[v2] Add math-inline benchmark

Message ID	002c01d0c30a$8f23c600$ad6b5200$@com
State	New
Headers	show Return-Path: <libc-alpha-return-61139-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:references:in-reply-to:subject:date :message-id:mime-version:content-type; q=dns; s=default; b=wkApa hgdOvJgUvQjJ0apU+y2CQ4id3x/cnnyTgKYiaItHRqj1rG+4fNKLTnJIyEJkrfnZ oladCazrlUf5wBmCrKg77tljfW7/GkmfZPgmM1EGPhXilJobaAg1vNpWAAGgwZ6T ENX7/j7bezo1Tay2txmwVuIYvJtr7pmhlguocU= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: "Wilco Dijkstra" <wdijkstr@arm.com> To: "'Siddhesh Poyarekar'" <siddhesh@redhat.com>, "Carlos O'Donell" <carlos@redhat.com> Cc: "'GNU C Library'" <libc-alpha@sourceware.org> References: <002001d0bfb8$b36fa330$1a4ee990$@com> <55A907F6.8000504@redhat.com> <20150717143406.GG19592@spoyarek.pnq.redhat.com> In-Reply-To: <20150717143406.GG19592@spoyarek.pnq.redhat.com> Subject: RE: [PATCH v2] Add math-inline benchmark Date: Mon, 20 Jul 2015 17:38:51 +0100 Message-ID: <002c01d0c30a$8f23c600$ad6b5200$@com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_002D_01D0C312.F0E82E00"

Wilco July 20, 2015, 4:38 p.m. UTC

> Siddhesh Poyarekar wrote:
> On Fri, Jul 17, 2015 at 09:49:42AM -0400, Carlos O'Donell wrote:
> > Maybe we can place this in another C file e.g. bench-util.c and #include that
> > and then call those functions?
> 
> Yes, that is fine.

I've extracted the start function in bench-util.c, see updated version.

Wilco
---
 benchtests/Makefile             |   7 +-
 benchtests/bench-math-inlines.c | 340 ++++++++++++++++++++++++++++++++++++++++
 benchtests/bench-skeleton.c     |  18 +--
 benchtests/bench-util.c         |  34 ++++
 benchtests/bench-util.h         |  29 ++++
 5 files changed, 412 insertions(+), 16 deletions(-)
 create mode 100644 benchtests/bench-math-inlines.c
 create mode 100644 benchtests/bench-util.c
 create mode 100644 benchtests/bench-util.h

Carlos O'Donell July 20, 2015, 7:23 p.m. UTC | #1

On 07/20/2015 12:38 PM, Wilco Dijkstra wrote:
>> Siddhesh Poyarekar wrote:
>> > On Fri, Jul 17, 2015 at 09:49:42AM -0400, Carlos O'Donell wrote:
>>> > > Maybe we can place this in another C file e.g. bench-util.c and #include that
>>> > > and then call those functions?
>> > 
>> > Yes, that is fine.
> I've extracted the start function in bench-util.c, see updated version.
> 
> Wilco

One nit, OK to commit with that fixed.

See at the very end.

> 0001-Add-bench-math-inlines.txt
> 
> 
> ---
>  benchtests/Makefile             |   7 +-
>  benchtests/bench-math-inlines.c | 340 ++++++++++++++++++++++++++++++++++++++++
>  benchtests/bench-skeleton.c     |  18 +--
>  benchtests/bench-util.c         |  34 ++++
>  benchtests/bench-util.h         |  29 ++++
>  5 files changed, 412 insertions(+), 16 deletions(-)
>  create mode 100644 benchtests/bench-math-inlines.c
>  create mode 100644 benchtests/bench-util.c
>  create mode 100644 benchtests/bench-util.h
> 
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 8e615e5..91970f8 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -36,6 +36,7 @@ string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
>  		strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
>  		strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok \
>  		strcoll
> +
>  string-bench-all := $(string-bench)
>  
>  # We have to generate locales
> @@ -50,7 +51,10 @@ stdlib-bench := strtod
>  
>  stdio-common-bench := sprintf
>  
> -benchset := $(string-bench-all) $(stdlib-bench) $(stdio-common-bench)
> +math-benchset := math-inlines
> +
> +benchset := $(string-bench-all) $(stdlib-bench) $(stdio-common-bench) \
> +	    $(math-benchset)

OK.

>  
>  CFLAGS-bench-ffs.c += -fno-builtin
>  CFLAGS-bench-ffsll.c += -fno-builtin
> @@ -58,6 +62,7 @@ CFLAGS-bench-ffsll.c += -fno-builtin
>  bench-malloc := malloc-thread
>  
>  $(addprefix $(objpfx)bench-,$(bench-math)): $(libm)
> +$(addprefix $(objpfx)bench-,$(math-benchset)): $(libm)
>  $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library)
>  $(objpfx)bench-malloc-thread: $(shared-thread-library)
>  
> diff --git a/benchtests/bench-math-inlines.c b/benchtests/bench-math-inlines.c
> new file mode 100644
> index 0000000..da8a433
> --- /dev/null
> +++ b/benchtests/bench-math-inlines.c
> @@ -0,0 +1,340 @@
> +/* Measure math inline functions.

OK.

> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define SIZE 1024
> +#define TEST_MAIN
> +#define TEST_NAME "math-inlines"
> +#define TEST_FUNCTION test_main ()
> +#include "bench-timing.h"
> +#include "json-lib.h"
> +#include "bench-util.h"
> +
> +#include <stdlib.h>
> +#include <math.h>
> +#include <stdint.h>
> +

OK.

> +
> +#define BOOLTEST(func)					  \
> +int							  \
> +func ## _t (volatile double *p, size_t n, size_t iters)   \
> +{							  \
> +  int i, j;						  \
> +  int res = 0;						  \
> +  for (j = 0; j < iters; j++)				  \
> +    for (i = 0; i < n; i++)				  \
> +      { double tmp = p[i] * 2.0;			  \
> +	if (func (tmp)) res += 5; }			  \
> +  return res;						  \
> +}
> +
> +#define VALUETEST(func)					  \
> +int							  \
> +func ## _t (volatile double *p, size_t n, size_t iters)	  \
> +{							  \
> +  int i, j;						  \
> +  int res = 0;						  \
> +  for (j = 0; j < iters; j++)				  \
> +    for (i = 0; i < n; i++)				  \
> +      { double tmp = p[i] * 2.0;			  \
> +	if (func (tmp)) res += 5; }			  \
> +  return res;						  \
> +}
> +
> +typedef union
> +{
> +  double value;
> +  uint64_t word;
> +} ieee_double_shape_type;
> +
> +#define EXTRACT_WORDS64(i,d)				  \
> +do {							  \
> +  ieee_double_shape_type gh_u;				  \
> +  gh_u.value = (d);					  \
> +  (i) = gh_u.word;					  \
> +} while (0)
> +
> +/* Explicit inlines similar to math_private.h versions.  */
> +
> +extern __always_inline int
> +__isnan_inl (double d)
> +{
> +  uint64_t di;
> +  EXTRACT_WORDS64 (di, d);
> +  return (di & 0x7fffffffffffffffull) > 0x7ff0000000000000ull;
> +}
> +
> +extern __always_inline int
> +__isnan_builtin (double d)
> +{
> +  return __builtin_isnan (d);
> +}
> +
> +extern __always_inline int
> +__isinf_inl (double x)
> +{
> +  uint64_t ix;
> +  EXTRACT_WORDS64 (ix,x);
> +  if ((ix << 1) != 0xffe0000000000000ull)
> +    return 0;
> +  return (int)(ix >> 32);
> +}
> +
> +extern __always_inline int
> +__isinf_ns (double d)
> +{
> +  uint64_t di;
> +  EXTRACT_WORDS64 (di, d);
> +  return (di & 0x7fffffffffffffffull) == 0x7ff0000000000000ull;
> +}
> +
> +extern __always_inline int
> +__isinf_ns_builtin (double d)
> +{
> +  return __builtin_isinf (d);
> +}
> +
> +extern __always_inline int
> +__isinf_builtin (double d)
> +{
> +  return __builtin_isinf_sign (d);
> +}
> +
> +
> +extern __always_inline int
> +__finite_inl (double d)
> +{
> +  uint64_t di;
> +  EXTRACT_WORDS64 (di, d);
> +  return (di & 0x7fffffffffffffffull) < 0x7ff0000000000000ull;
> +}
> +
> +extern __always_inline int
> +__isfinite_builtin (double d)
> +{
> +  return __builtin_isfinite (d);
> +}
> +
> +
> +/* Explicit inline similar to existing math.h implementation.  */
> +
> +#define __isnormal_inl(X) (__fpclassify (X) == FP_NORMAL)
> +#define __isnormal_inl2(X) (fpclassify (X) == FP_NORMAL)
> +
> +extern __always_inline int
> +__isnormal_builtin (double d)
> +{
> +  return __builtin_isnormal (d);
> +}
> +
> +/* Test fpclassify with use of only 2 of the 5 results.  */
> +
> +extern __always_inline int
> +__fpclassify_test1 (double d)
> +{
> +  int cl = fpclassify (d);
> +  return cl == FP_NAN || cl == FP_INFINITE;
> +}
> +
> +extern __always_inline int
> +__fpclassify_test2 (double d)
> +{
> +  return __builtin_isnan (d) || __builtin_isinf (d);
> +}
> +
> +double __attribute ((noinline))
> +kernel_standard (double x, double y, int z)
> +{
> +  return x * y + z;
> +}
> +
> +double __attribute ((noinline))
> +remainder2 (double x, double y)
> +{
> +  if (((__builtin_expect (y == 0.0, 0) && !__builtin_isnan (x))
> +	|| (__builtin_expect (__builtin_isinf (x), 0) && !__builtin_isnan (y))))
> +    return kernel_standard (x, y, 10);
> +
> +  return remainder (x, y);
> +}
> +
> +double __attribute ((noinline))
> +remainder1 (double x, double y)
> +{
> +  if (((__builtin_expect (y == 0.0, 0) && !__isnan_inl (x))
> +       || (__builtin_expect (__isinf_ns (x), 0) && !__isnan_inl (y))))
> +    return kernel_standard (x, y, 10);
> +
> +  return remainder (x, y);
> +}
> +
> +volatile double rem1 = 2.5;
> +
> +extern __always_inline int
> +remainder_test1 (double d)
> +{
> +  return remainder1 (d, rem1);
> +}
> +
> +extern __always_inline int
> +remainder_test2 (double d)
> +{
> +  return remainder2 (d, rem1);
> +}
> +
> +/* Create test functions for each possibility.  */
> +
> +BOOLTEST (__isnan)
> +BOOLTEST (__isnan_inl)
> +BOOLTEST (__isnan_builtin)
> +BOOLTEST (isnan)
> +
> +BOOLTEST (__isinf)
> +BOOLTEST (__isinf_inl)
> +BOOLTEST (__isinf_ns)
> +BOOLTEST (__isinf_ns_builtin)
> +BOOLTEST (__isinf_builtin)
> +BOOLTEST (isinf)
> +
> +BOOLTEST (__finite)
> +BOOLTEST (__finite_inl)
> +BOOLTEST (__isfinite_builtin)
> +BOOLTEST (isfinite)
> +
> +BOOLTEST (__isnormal_inl)
> +BOOLTEST (__isnormal_inl2)
> +BOOLTEST (__isnormal_builtin)
> +BOOLTEST (isnormal)
> +
> +BOOLTEST (__fpclassify_test1)
> +BOOLTEST (__fpclassify_test2)
> +VALUETEST (__fpclassify)
> +VALUETEST (fpclassify)
> +
> +BOOLTEST (remainder_test1)
> +BOOLTEST (remainder_test2)
> +
> +typedef int (*proto_t) (volatile double *p, size_t n, size_t iters);
> +
> +typedef struct
> +{
> +  const char *name;
> +  proto_t fn;
> +} impl_t;
> +
> +#define IMPL(name) { #name, name }
> +
> +impl_t test_list[] =
> +{
> +  IMPL (__isnan_t),
> +  IMPL (__isnan_inl_t),
> +  IMPL (__isnan_builtin_t),
> +  IMPL (isnan_t),
> +
> +  IMPL (__isinf_t),
> +  IMPL (__isinf_inl_t),
> +  IMPL (__isinf_ns_t),
> +  IMPL (__isinf_ns_builtin_t),
> +  IMPL (__isinf_builtin_t),
> +  IMPL (isinf_t),
> +
> +  IMPL (__finite_t),
> +  IMPL (__finite_inl_t),
> +  IMPL (__isfinite_builtin_t),
> +  IMPL (isfinite_t),
> +
> +  IMPL (__isnormal_inl_t),
> +  IMPL (__isnormal_inl2_t),
> +  IMPL (__isnormal_builtin_t),
> +  IMPL (isnormal_t),
> +
> +  IMPL (__fpclassify_test1_t),
> +  IMPL (__fpclassify_test2_t),
> +  IMPL (__fpclassify_t),
> +  IMPL (fpclassify_t),
> +
> +  IMPL (remainder_test1_t),
> +  IMPL (remainder_test2_t)
> +};
> +
> +static void
> +do_one_test (json_ctx_t *json_ctx, proto_t test_fn, volatile double *arr,
> +	     size_t len, const char *testname)
> +{
> +  size_t iters = 500;
> +  timing_t start, stop, cur;
> +
> +  json_attr_object_begin (json_ctx, testname);
> +
> +  TIMING_NOW (start);
> +  test_fn (arr, len, iters);
> +  TIMING_NOW (stop);
> +  TIMING_DIFF (cur, start, stop);
> +
> +  json_attr_double (json_ctx, "duration", cur);
> +  json_attr_double (json_ctx, "iterations", iters);
> +  json_attr_double (json_ctx, "mean", cur / iters);
> +  json_attr_object_end (json_ctx);
> +}
> +
> +static volatile double arr1[SIZE];
> +static volatile double arr2[SIZE];
> +
> +int
> +test_main (void)
> +{
> +  json_ctx_t json_ctx;
> +  size_t i;
> +
> +  bench_start ();
> +
> +  json_init (&json_ctx, 2, stdout);
> +  json_attr_object_begin (&json_ctx, "math-inlines");
> +
> +  /* Create 2 test arrays, one with 10% zeroes, 10% negative values,
> +     79% positive values and 1% infinity/NaN.  The other contains
> +     50% inf, 50% NaN.  */
> +
> +  for (i = 0; i < SIZE; i++)
> +    {
> +      int x = rand () & 255;
> +      arr1[i] = (x < 25) ? 0.0 : ((x < 50) ? -1 : 100);
> +      if (x == 255) arr1[i] = __builtin_inf ();
> +      if (x == 254) arr1[i] = __builtin_nan ("0");
> +      arr2[i] = (x < 128) ? __builtin_inf () : __builtin_nan ("0");
> +    }
> +
> +  for (i = 0; i < sizeof (test_list) / sizeof (test_list[0]); i++)
> +    {
> +      json_attr_object_begin (&json_ctx, test_list[i].name);
> +      do_one_test (&json_ctx, test_list[i].fn, arr2, SIZE, "inf/nan");
> +      json_attr_object_end (&json_ctx);
> +    }
> +
> +  for (i = 0; i < sizeof (test_list) / sizeof (test_list[0]); i++)
> +    {
> +      json_attr_object_begin (&json_ctx, test_list[i].name);
> +      do_one_test (&json_ctx, test_list[i].fn, arr1, SIZE, "normal");
> +      json_attr_object_end (&json_ctx);
> +    }
> +
> +  json_attr_object_end (&json_ctx);
> +  return 0;
> +}
> +
> +#include "bench-util.c"
> +#include "../test-skeleton.c"
> diff --git a/benchtests/bench-skeleton.c b/benchtests/bench-skeleton.c
> index e357f0c..bc820df 100644
> --- a/benchtests/bench-skeleton.c
> +++ b/benchtests/bench-skeleton.c
> @@ -24,21 +24,9 @@
>  #include <inttypes.h>
>  #include "bench-timing.h"
>  #include "json-lib.h"
> +#include "bench-util.h"
>  
> -volatile unsigned int dontoptimize = 0;
> -
> -void
> -startup (void)
> -{
> -  /* This loop should cause CPU to switch to maximal freqency.
> -     This makes subsequent measurement more accurate.  We need a side effect
> -     to prevent the loop being deleted by compiler.
> -     This should be enough to cause CPU to speed up and it is simpler than
> -     running loop for constant time. This is used when user does not have root
> -     access to set a constant freqency.  */
> -  for (int k = 0; k < 10000000; k++)
> -    dontoptimize += 23 * dontoptimize + 2;
> -}
> +#include "bench-util.c"

OK.

>  
>  #define TIMESPEC_AFTER(a, b) \
>    (((a).tv_sec == (b).tv_sec) ?						      \
> @@ -56,7 +44,7 @@ main (int argc, char **argv)
>    if (argc == 2 && !strcmp (argv[1], "-d"))
>      detailed = true;
>  
> -  startup();
> +  bench_start ();

OK.

>  
>    memset (&runtime, 0, sizeof (runtime));
>  
> diff --git a/benchtests/bench-util.c b/benchtests/bench-util.c
> new file mode 100644
> index 0000000..c4149ae
> --- /dev/null
> +++ b/benchtests/bench-util.c
> @@ -0,0 +1,34 @@
> +/* Benchmark utility functions.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +
> +static volatile unsigned int dontoptimize = 0;
> +
> +void
> +bench_start (void)
> +{
> +  /* This loop should cause CPU to switch to maximal freqency.
> +     This makes subsequent measurement more accurate.  We need a side effect
> +     to prevent the loop being deleted by compiler.
> +     This should be enough to cause CPU to speed up and it is simpler than
> +     running loop for constant time.  This is used when user does not have root
> +     access to set a constant freqency.  */
> +
> +  for (int k = 0; k < START_ITER; k++)
> +    dontoptimize += 23 * dontoptimize + 2;
> +}

OK.

> diff --git a/benchtests/bench-util.h b/benchtests/bench-util.h
> new file mode 100644
> index 0000000..b30b085
> --- /dev/null
> +++ b/benchtests/bench-util.h
> @@ -0,0 +1,29 @@
> +/* Benchmark utility functions.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +
> +#ifndef START_ITER
> +# define START_ITER (100000000)
> +#endif
> +
> +/* bench_start reduces the random variations due to frequency scaling by
> +   executing a small loop with many memory accesses.  START_ITER controls
> +   the number of iterations.  */
> +
> +void
> +bench_start (void);

One line please, only function definitions are split on two lines to make
it easier to find e.g. grep '^bench_start' *. Per GNU Coding Standard.

OK with this change.

> -- 1.9.1

Cheers,
Carlos.

Ondřej Bílka July 21, 2015, 11:35 a.m. UTC | #2

On Mon, Jul 20, 2015 at 03:23:50PM -0400, Carlos O'Donell wrote:
> On 07/20/2015 12:38 PM, Wilco Dijkstra wrote:
> >> Siddhesh Poyarekar wrote:
> >> > On Fri, Jul 17, 2015 at 09:49:42AM -0400, Carlos O'Donell wrote:
> >>> > > Maybe we can place this in another C file e.g. bench-util.c and #include that
> >>> > > and then call those functions?
> >> > 
> >> > Yes, that is fine.
> > I've extracted the start function in bench-util.c, see updated version.
> > 
> > Wilco
> 
> One nit, OK to commit with that fixed.
>
No Carlos, this isn't ok. You need to do better review as this patch has
still some issues. 

I am not talking about problems that this doesn't measure real workload
at all but simple problems that reviewer should point out.

Here its that builtins were added but not well. As this doesn't have
builtin_fpclassify comparing these doesn't make sense.


> > +  int cl = fpclassify (d);
> > +  return cl == FP_NAN || cl == FP_INFINITE;

and

> > +  return __builtin_isnan (d) || __builtin_isinf (d);

Also you don't test inlines in isnormal:

> > +/* Explicit inline similar to existing math.h implementation.  */
> > +
> > +#define __isnormal_inl(X) (__fpclassify (X) == FP_NORMAL)

Wilco July 21, 2015, 4:17 p.m. UTC | #3

> Ondřej Bílka wrote:
> On Mon, Jul 20, 2015 at 03:23:50PM -0400, Carlos O'Donell wrote:
> > On 07/20/2015 12:38 PM, Wilco Dijkstra wrote:
> > >> Siddhesh Poyarekar wrote:
> > >> > On Fri, Jul 17, 2015 at 09:49:42AM -0400, Carlos O'Donell wrote:
> > >>> > > Maybe we can place this in another C file e.g. bench-util.c and #include that
> > >>> > > and then call those functions?
> > >> >
> > >> > Yes, that is fine.
> > > I've extracted the start function in bench-util.c, see updated version.
> > >
> > > Wilco
> >
> > One nit, OK to commit with that fixed.
> >
> No Carlos, this isn't ok. You need to do better review as this patch has
> still some issues.

I don't think there is any point in blaming Carlos.

> I am not talking about problems that this doesn't measure real workload
> at all but simple problems that reviewer should point out.
> 
> Here its that builtins were added but not well. As this doesn't have
> builtin_fpclassify comparing these doesn't make sense.
> 
> 
> > > +  int cl = fpclassify (d);
> > > +  return cl == FP_NAN || cl == FP_INFINITE;
> 
> and
> 
> > > +  return __builtin_isnan (d) || __builtin_isinf (d);

Fpclassify uses the builtin, but I think it's better to remove these
tests as they don't add much value. It's pretty obvious it is better
to use separate tests instead of fpclassify when you only need 2 of them.

> Also you don't test inlines in isnormal:
> 
> > > +/* Explicit inline similar to existing math.h implementation.  */
> > > +
> > > +#define __isnormal_inl(X) (__fpclassify (X) == FP_NORMAL)

This is an inline and exactly what the current math.h does. The goal of my
benchmark is comparing the existing implementation with the new inlines.

Wilco

Carlos O'Donell July 21, 2015, 4:43 p.m. UTC | #4

On 07/21/2015 12:17 PM, Wilco Dijkstra wrote:
>>>> +  return __builtin_isnan (d) || __builtin_isinf (d);
> 
> Fpclassify uses the builtin, but I think it's better to remove these
> tests as they don't add much value. It's pretty obvious it is better
> to use separate tests instead of fpclassify when you only need 2 of them.

Thanks Wilco, please post a follow up patch.

>> Also you don't test inlines in isnormal:
>>
>>>> +/* Explicit inline similar to existing math.h implementation.  */
>>>> +
>>>> +#define __isnormal_inl(X) (__fpclassify (X) == FP_NORMAL)
> 
> This is an inline and exactly what the current math.h does. The goal of my
> benchmark is comparing the existing implementation with the new inlines.

Agreed.

Cheers,
Carlos.

Carlos O'Donell July 21, 2015, 4:47 p.m. UTC | #5

On 07/21/2015 07:35 AM, Ondřej Bílka wrote:
>> One nit, OK to commit with that fixed.
>>
> No Carlos, this isn't ok. You need to do better review as this patch has
> still some issues. 

It is incremental progress for code that has no immediate API or ABI impact.

Therefore I judge it to be OK.

I expect Wilco to improve it incrementally, and he has already agreed to
remove the test that has little value.

Thanks Wilco.

Cheers,
Carlos.

Ondřej Bílka July 21, 2015, 5:04 p.m. UTC | #6

On Tue, Jul 21, 2015 at 05:17:23PM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Mon, Jul 20, 2015 at 03:23:50PM -0400, Carlos O'Donell wrote:
> > > On 07/20/2015 12:38 PM, Wilco Dijkstra wrote:
> > > >> Siddhesh Poyarekar wrote:
> > > >> > On Fri, Jul 17, 2015 at 09:49:42AM -0400, Carlos O'Donell wrote:
> > > >>> > > Maybe we can place this in another C file e.g. bench-util.c and #include that
> > > >>> > > and then call those functions?
> > > >> >
> > > >> > Yes, that is fine.
> > > > I've extracted the start function in bench-util.c, see updated version.
> > > >
> > > > Wilco
> > >
> > > One nit, OK to commit with that fixed.
> > >
> > No Carlos, this isn't ok. You need to do better review as this patch has
> > still some issues.
> 
> I don't think there is any point in blaming Carlos.
>
One needs to review patches to fix issues as early as possible. Saying
that something is ok when it isn't doesn't help. Just quickly post v3.

> > I am not talking about problems that this doesn't measure real workload
> > at all but simple problems that reviewer should point out.
> > 
> > Here its that builtins were added but not well. As this doesn't have
> > builtin_fpclassify comparing these doesn't make sense.
> > 
> > 
> > > > +  int cl = fpclassify (d);
> > > > +  return cl == FP_NAN || cl == FP_INFINITE;
> > 
> > and
> > 
> > > > +  return __builtin_isnan (d) || __builtin_isinf (d);
> 
> Fpclassify uses the builtin, but I think it's better to remove these
> tests as they don't add much value. It's pretty obvious it is better
> to use separate tests instead of fpclassify when you only need 2 of them.
> 
I agree that its better to remove them.

However it isn't obvious that separate tests are better as its wrong. It
should be same code as gcc should optimize that to better but often
isnt. When I looked to assembly of my benchmark of noinline tests then
performance difference was caused just by gcc bug. In fpclassify case it
increased and decreased stack pointer without writing anything to stack.

And its easy to find cases where opposite is true as this depends how
you write benchmark, for example if I change my benchmark to use following:

extern __always_inline int
__fpclassify_test1 (double d)
{
  int cl = fpclassify (d);
  return (cl == FP_INFINITE) ? 3 * sin (d) :
         ((cl == FP_NAN) ? 2 * cos (d) : 0) ;
}

extern __always_inline int
__fpclassify_test2 (double d)
{
  return (__builtin_isinf(d)) ? 3 * sin (d) :
         ((__builtin_isnan(d)) ? 2 * cos (d) : 0) ;

}

then fpclassify becomes lot faster.

   "__fpclassify_test1_t": {
    "normal": {
     "duration": 5.91344e+06,
     "iterations": 500,
     "mean": 11826
    }
   },
   "__fpclassify_test2_t": {
    "normal": {
     "duration": 7.45868e+06,
     "iterations": 500,
     "mean": 14917
    }
   },



> > Also you don't test inlines in isnormal:
> > 
> > > > +/* Explicit inline similar to existing math.h implementation.  */
> > > > +
> > > > +#define __isnormal_inl(X) (__fpclassify (X) == FP_NORMAL)
> 
> This is an inline and exactly what the current math.h does. The goal of my
> benchmark is comparing the existing implementation with the new inlines.
> 
Then yqu should compare existing implementation with new inlines. Now
both would just call fpclassify without your patch.

Ondřej Bílka July 21, 2015, 5:35 p.m. UTC | #7

On Tue, Jul 21, 2015 at 12:47:05PM -0400, Carlos O'Donell wrote:
> On 07/21/2015 07:35 AM, Ondřej Bílka wrote:
> >> One nit, OK to commit with that fixed.
> >>
> > No Carlos, this isn't ok. You need to do better review as this patch has
> > still some issues. 
> 
> It is incremental progress for code that has no immediate API or ABI impact.
> 
> Therefore I judge it to be OK.
> 
> I expect Wilco to improve it incrementally, and he has already agreed to
> remove the test that has little value.
> 
Then you should say it so. There is additional problem that it doesn't
measure isnormal inline at all so somebody could pick that and find that
builtin doesn't improve isnormal just because of typo here. This is
relatively harmless here but don't have to be.

Wilco July 21, 2015, 5:46 p.m. UTC | #8

> Ondřej Bílka wrote:
> On Tue, Jul 21, 2015 at 05:17:23PM +0100, Wilco Dijkstra wrote:

> > Fpclassify uses the builtin, but I think it's better to remove these
> > tests as they don't add much value. It's pretty obvious it is better
> > to use separate tests instead of fpclassify when you only need 2 of them.
> >
> I agree that its better to remove them.
> 
> However it isn't obvious that separate tests are better as its wrong. It
> should be same code as gcc should optimize that to better but often
> isnt. When I looked to assembly of my benchmark of noinline tests then
> performance difference was caused just by gcc bug. In fpclassify case it
> increased and decreased stack pointer without writing anything to stack.

Sure optimization makes a difference to these benchmarks, that's unavoidable,
so that's why I always carefully check the generated assembly to ensure I
really measure what I wanted to measure.

> And its easy to find cases where opposite is true as this depends how
> you write benchmark, for example if I change my benchmark to use following:
> 
> extern __always_inline int
> __fpclassify_test1 (double d)
> {
>   int cl = fpclassify (d);
>   return (cl == FP_INFINITE) ? 3 * sin (d) :
>          ((cl == FP_NAN) ? 2 * cos (d) : 0) ;
> }
> 
> extern __always_inline int
> __fpclassify_test2 (double d)
> {
>   return (__builtin_isinf(d)) ? 3 * sin (d) :
>          ((__builtin_isnan(d)) ? 2 * cos (d) : 0) ;
> 
> }
> 
> then fpclassify becomes lot faster.
> 
>    "__fpclassify_test1_t": {
>     "normal": {
>      "duration": 5.91344e+06,
>      "iterations": 500,
>      "mean": 11826
>     }
>    },
>    "__fpclassify_test2_t": {
>     "normal": {
>      "duration": 7.45868e+06,
>      "iterations": 500,
>      "mean": 14917
>     }
>    },

Well that doesn't seem correct. Here are my results for the above
using latest GCC and modern x64 hardware:

   "__fpclassify_test1_t": {
    "normal": {
     "duration": 1.99627e+06,
     "iterations": 500,
     "mean": 3992
    }
   },
   "__fpclassify_test2_t": {
    "normal": {
     "duration": 1.9893e+06,
     "iterations": 500,
     "mean": 3978
    }
   },

> > > Also you don't test inlines in isnormal:
> > >
> > > > > +/* Explicit inline similar to existing math.h implementation.  */
> > > > > +
> > > > > +#define __isnormal_inl(X) (__fpclassify (X) == FP_NORMAL)
> >
> > This is an inline and exactly what the current math.h does. The goal of my
> > benchmark is comparing the existing implementation with the new inlines.
> >
> Then yqu should compare existing implementation with new inlines. Now
> both would just call fpclassify without your patch.

OK, I'll add a builtin version for fpclassify as well (and use that in
__isnormal_inl2) - then all tests have the non-inline version, an internal 
GLIBC inline if it exists, the builtin and the math.h version.

Wilco

Ondřej Bílka July 21, 2015, 6:12 p.m. UTC | #9

On Tue, Jul 21, 2015 at 06:46:38PM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Tue, Jul 21, 2015 at 05:17:23PM +0100, Wilco Dijkstra wrote:
> 
> > > Fpclassify uses the builtin, but I think it's better to remove these
> > > tests as they don't add much value. It's pretty obvious it is better
> > > to use separate tests instead of fpclassify when you only need 2 of them.
> > >
> > I agree that its better to remove them.
> > 
> > However it isn't obvious that separate tests are better as its wrong. It
> > should be same code as gcc should optimize that to better but often
> > isnt. When I looked to assembly of my benchmark of noinline tests then
> > performance difference was caused just by gcc bug. In fpclassify case it
> > increased and decreased stack pointer without writing anything to stack.
> 
> Sure optimization makes a difference to these benchmarks, that's unavoidable,
> so that's why I always carefully check the generated assembly to ensure I
> really measure what I wanted to measure.
> 
> > And its easy to find cases where opposite is true as this depends how
> > you write benchmark, for example if I change my benchmark to use following:
> > 
> > extern __always_inline int
> > __fpclassify_test1 (double d)
> > {
> >   int cl = fpclassify (d);
> >   return (cl == FP_INFINITE) ? 3 * sin (d) :
> >          ((cl == FP_NAN) ? 2 * cos (d) : 0) ;
> > }
> > 
> > extern __always_inline int
> > __fpclassify_test2 (double d)
> > {
> >   return (__builtin_isinf(d)) ? 3 * sin (d) :
> >          ((__builtin_isnan(d)) ? 2 * cos (d) : 0) ;
> > 
> > }
> > 
> > then fpclassify becomes lot faster.
> > 
> >    "__fpclassify_test1_t": {
> >     "normal": {
> >      "duration": 5.91344e+06,
> >      "iterations": 500,
> >      "mean": 11826
> >     }
> >    },
> >    "__fpclassify_test2_t": {
> >     "normal": {
> >      "duration": 7.45868e+06,
> >      "iterations": 500,
> >      "mean": 14917
> >     }
> >    },
> 
> Well that doesn't seem correct. Here are my results for the above
> using latest GCC and modern x64 hardware:
>
Thats what your benchmark produces. Again difference between mine
benchmarks and yours is in how you handled constants loads by forbidding
inlining. My benchmark used following:

#define BOOLTEST(func)                                    \
int __attribute__((noinline)) func## _t_t (double d,int a)        \
{                                                         \
  if (func(d))                                            \
    return 3 * sin (d) + a;                                       \
  else                                                    \
    return 2;                                             \
}                                                         \
int                                                       \
func ## _t (volatile double *p, size_t n, size_t iters)   \
{                                                         \
  int i, j;                                               \
  int res = 0;                                            \
  for (j = 0; j < iters; j++)                             \
    for (i = 0; i < n; i++)                               \
      res += func## _t_t (2.0 * p[i], i);                         \
  return res;                                             \
}

 
>    "__fpclassify_test1_t": {
>     "normal": {
>      "duration": 1.99627e+06,
>      "iterations": 500,
>      "mean": 3992
>     }
>    },
>    "__fpclassify_test2_t": {
>     "normal": {
>      "duration": 1.9893e+06,
>      "iterations": 500,
>      "mean": 3978
>     }
>    },
> 
> > > > Also you don't test inlines in isnormal:
> > > >
> > > > > > +/* Explicit inline similar to existing math.h implementation.  */
> > > > > > +
> > > > > > +#define __isnormal_inl(X) (__fpclassify (X) == FP_NORMAL)
> > >
> > > This is an inline and exactly what the current math.h does. The goal of my
> > > benchmark is comparing the existing implementation with the new inlines.
> > >
> > Then yqu should compare existing implementation with new inlines. Now
> > both would just call fpclassify without your patch.
> 
> OK, I'll add a builtin version for fpclassify as well (and use that in
> __isnormal_inl2) - then all tests have the non-inline version, an internal 
> GLIBC inline if it exists, the builtin and the math.h version.
>

Carlos O'Donell July 21, 2015, 7:11 p.m. UTC | #10

On 07/21/2015 01:35 PM, Ondřej Bílka wrote:
> On Tue, Jul 21, 2015 at 12:47:05PM -0400, Carlos O'Donell wrote:
>> On 07/21/2015 07:35 AM, Ondřej Bílka wrote:
>>>> One nit, OK to commit with that fixed.
>>>>
>>> No Carlos, this isn't ok. You need to do better review as this patch has
>>> still some issues. 
>>
>> It is incremental progress for code that has no immediate API or ABI impact.
>>
>> Therefore I judge it to be OK.
>>
>> I expect Wilco to improve it incrementally, and he has already agreed to
>> remove the test that has little value.
>>
> Then you should say it so. There is additional problem that it doesn't
> measure isnormal inline at all so somebody could pick that and find that
> builtin doesn't improve isnormal just because of typo here. This is
> relatively harmless here but don't have to be.

My apologies, I will endeavour to be more clear next time I give consensus
for a patch.

If I or Wilco missed something the next step is to respond to the original
email with a detailed response about what is missing. Wilco has responded
to the question about isnormal, please respond to his answer and take it
from there to see if we need to change the test.

If at all possible please provide actionable patches that show how you
think the implementation should be.

Cheers,
Carlos.

Ondřej Bílka July 23, 2015, 3:19 p.m. UTC | #11

On Tue, Jul 21, 2015 at 03:11:00PM -0400, Carlos O'Donell wrote:
> On 07/21/2015 01:35 PM, Ondřej Bílka wrote:
> > On Tue, Jul 21, 2015 at 12:47:05PM -0400, Carlos O'Donell wrote:
> >> On 07/21/2015 07:35 AM, Ondřej Bílka wrote:
> >>>> One nit, OK to commit with that fixed.
> >>>>
> >>> No Carlos, this isn't ok. You need to do better review as this patch has
> >>> still some issues. 
> >>
> >> It is incremental progress for code that has no immediate API or ABI impact.
> >>
> >> Therefore I judge it to be OK.
> >>
> >> I expect Wilco to improve it incrementally, and he has already agreed to
> >> remove the test that has little value.
> >>
> > Then you should say it so. There is additional problem that it doesn't
> > measure isnormal inline at all so somebody could pick that and find that
> > builtin doesn't improve isnormal just because of typo here. This is
> > relatively harmless here but don't have to be.
> 
> My apologies, I will endeavour to be more clear next time I give consensus
> for a patch.
> 
> If I or Wilco missed something the next step is to respond to the original
> email with a detailed response about what is missing. Wilco has responded
> to the question about isnormal, please respond to his answer and take it
> from there to see if we need to change the test.
> 
I just asked you to read patch in more critical way so you won't miss
that. 

> If at all possible please provide actionable patches that show how you
> think the implementation should be.
> 
I already send that previously in thread, read my benchmark.

Carlos O'Donell July 23, 2015, 3:26 p.m. UTC | #12

On 07/23/2015 11:19 AM, Ondřej Bílka wrote:
>> If at all possible please provide actionable patches that show how you
>> think the implementation should be.
>>
> I already send that previously in thread, read my benchmark.

This is a long thread. Please help by providing a summary URL or reference
directly to the full patch you show, that either applies on top of Wilco's
work or replaces it as a complete math-inline benchmark.

Cheers,
Carlos.

Ondřej Bílka July 23, 2015, 9:43 p.m. UTC | #13

On Thu, Jul 23, 2015 at 11:26:49AM -0400, Carlos O'Donell wrote:
> On 07/23/2015 11:19 AM, Ondřej Bílka wrote:
> >> If at all possible please provide actionable patches that show how you
> >> think the implementation should be.
> >>
> > I already send that previously in thread, read my benchmark.
> 
> This is a long thread. Please help by providing a summary URL or reference
> directly to the full patch you show, that either applies on top of Wilco's
> work or replaces it as a complete math-inline benchmark.
> 

There are two sets of benchmarks, for simplicity I first made this
standalone to show effects of inlining.

https://sourceware.org/ml/libc-alpha/2015-07/msg00343.html

Then I expanded my benchmark for following.

http://permalink.gmane.org/gmane.comp.lib.glibc.alpha/53314

These clearly show that there is no consensus about benchmarks yet as
these benchmarks give different results than Wilco and it isn't clear 
what results are real and what are caused by bugs in gcc or selecting
specific workload.

[v2] Add math-inline benchmark

Commit Message

Comments

Patch