Message ID | 001201d0a75b$921d9860$b658c920$@com |
---|---|
State | New |
Headers | show |
On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > Add inlining of the C99 math functions > isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC built-ins > when available. Since going through the PLT is expensive for these small > functions, inlining results in major speedups (about 7x on Cortex-A57 > for isinf). The GCC built-ins are not correct if signalling NaN support Where are the benchmarks for this? Please put them in benchtests so actual reproducible figures can be given. That's the standard practice for any change being justified on the basis of performance. What are the effects on code size (especially for fpclassify)? If code becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be considered. > As a result of this many target overrides and the various > __isnan/__finite inlines in math_private.h are no longer required. If > agreed we could remove all this code and only keep the generic > definition of isinf/etc which will use the builtin. How do those inlines compare with the code GCC generates? As I understand it, before your patch libm-internal calls to these macros would have called the inline versions of functions such as __finite (no function call, PLT or otherwise, and using integer arithmetic), and after the patch libm-internal calls would have used the GCC inlines (with floating-point arithmetic) - is the new state for libm-internal calls better than the old state or not? > 2015-06-15 Wilco Dijkstra <wdijkstr@arm.com> > [BZ #15367] [BZ #17441]
> Joseph Myers wrote: > On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > > > Add inlining of the C99 math functions > > isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC built-ins > > when available. Since going through the PLT is expensive for these small > > functions, inlining results in major speedups (about 7x on Cortex-A57 > > for isinf). The GCC built-ins are not correct if signalling NaN support > > Where are the benchmarks for this? Please put them in benchtests so > actual reproducible figures can be given. That's the standard practice > for any change being justified on the basis of performance. I'll add a benchmark in another patch - it's not trivial as benchtest is not suitable to accurately time very simple functions, especially when inlined... > What are the effects on code size (especially for fpclassify)? If code > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be > considered. Codesize of what? Few applications use these functions... GLIBC mathlib is the main example, but it's not typical as it has extremely high usage of these functions and you'll unlikely want to avoid inlining as the performance gain and avoidance of explicit calls are significant benefits in the math functions (no need to use callee-saves). I suppose adding a size check for fpclassify would be reasonable as the expansion ends up quite large (but that should really be fixed). Note fpclassify is used rather inappropriately in most cases in GLIBC, so the best approach would be rewrite those to use faster isnan/isinf checks. > > As a result of this many target overrides and the various > > __isnan/__finite inlines in math_private.h are no longer required. If > > agreed we could remove all this code and only keep the generic > > definition of isinf/etc which will use the builtin. > > How do those inlines compare with the code GCC generates? As I understand > it, before your patch libm-internal calls to these macros would have > called the inline versions of functions such as __finite (no function > call, PLT or otherwise, and using integer arithmetic), and after the patch > libm-internal calls would have used the GCC inlines (with floating-point > arithmetic) - is the new state for libm-internal calls better than the old > state or not? __isinf(f/l) were never inlined, but are now (__isinf_ns was used in some cases, but not nearly all). Fpclassify*, isnormal* were never inlined, but now are. These new inlines provide the gains. __isnan(f) and __finite(f) used GLIBC internal inlines, and now use GCC built-ins instead. We could measure whether this specific change causes any difference, but I bet it will be very uarch specific and minor compared to the total time math functions take. Also I don't think having special inlines that are only used inside GLIBC is a good approach - if the GCC built-ins are not fast enough then we should fix them. Wilco
On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > > Where are the benchmarks for this? Please put them in benchtests so > > actual reproducible figures can be given. That's the standard practice > > for any change being justified on the basis of performance. > > I'll add a benchmark in another patch - it's not trivial as benchtest is not > suitable to accurately time very simple functions, especially when inlined... Well, the benchmark should come first.... > > What are the effects on code size (especially for fpclassify)? If code > > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be > > considered. > > Codesize of what? Few applications use these functions... GLIBC mathlib is Size of any code calling these macros (for nonconstant arguments). > Also I don't think having special inlines that are only used inside > GLIBC is a good approach - if the GCC built-ins are not fast enough then > we should fix them. Yes, we should improve the built-in functions, but first we should understand the effects on performance of glibc libm functions (I don't know if the existing benchtests cover cases where __finite / __isnan / __isinf_ns inlines were used) to see if this cleanup patch indeed doesn't significantly harm performance of affected libm functions (and possibly improves performance through the changes in cases that wouldn't previously have been inlined at all).
On Mon, Jun 15, 2015 at 05:40:11PM +0100, Wilco Dijkstra wrote: > > Joseph Myers wrote: > > On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > > > > > Add inlining of the C99 math functions > > > isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC built-ins > > > when available. Since going through the PLT is expensive for these small > > > functions, inlining results in major speedups (about 7x on Cortex-A57 > > > for isinf). The GCC built-ins are not correct if signalling NaN support > > > > Where are the benchmarks for this? Please put them in benchtests so > > actual reproducible figures can be given. That's the standard practice > > for any change being justified on the basis of performance. > > I'll add a benchmark in another patch - it's not trivial as benchtest is not > suitable to accurately time very simple functions, especially when inlined... > As I wrote in other thread that gcc builtins have poor performance a benchmark is tricky. Main problem is that these tests are in branch and gcc will simplify them. As builtins are branchless it obviously couldn't simplify them. You could reuse my example to show problem with builtins as benchtest, you could make worse mistakes and inlining would still save a day. You will get different characteristic in following three cases. In case 1 isinf is simplified by surrounding if so we measure time of branch never taken. Case 2 is identical except that in case 1 conversion to branchless ret += 42 * __builtin_isinf(d[i]) migth make sense where gcc could make isinf return 1 when its infinity. In case 3 you get raw sum, difference between builtin and optimized inline is bigger, mainly because of dependency chain. Then numbers you get are also architecture specific so its hard to guess exact numbers except that it will help as inlining gives huge boost. For other functions performance should be identical on optimized inline as it just checks that exponent is 1023 and fails, what happens after that check is irrelevant. On my ivy bridge I get following where bti is optimized inline of case i and nti is gcc builtin for same case. That performance becomes different when you decide to isolate builtin and inline expansion into separately compiled function as loops gain a lot by not having to initialize constants. So you first need to tell what are your benchmark characteristic before writing benchmark. A sample follows time ./bt1 real 0m0.788s user 0m0.772s sys 0m0.000s 19:44:18:~$ time ./nt1 real 0m1.162s user 0m1.136s sys 0m0.004s 19:44:23:~$ time ./bt2 real 0m0.777s user 0m0.760s sys 0m0.000s 19:44:28:~$ time ./nt2 real 0m1.167s user 0m1.144s sys 0m0.000s 19:44:32:~$ time ./bt3 real 0m1.161s user 0m1.140s sys 0m0.000s 19:44:37:~$ time ./nt3 real 0m1.774s user 0m1.740s sys 0m0.000s #ifdef BRANCHED static inline int isinf (double dx) { union u { double d; long l; }; union u u; u.d = dx; long x = u.l; return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 : -1) : 0; } #endif int main() { double ret= 0.0; int i, j; double *d = malloc (800000); for (j=0; j<1000000; j++) #ifdef T1 for (i=0; i<1000; i++) if (__builtin_expect(isinf (d[i]),0)) ret += 42; #endif #ifdef T2 for (i=0; i<1000; i++) if (__builtin_expect(isinf (d[i]),0)) abort(); #endif #ifdef T3 for (i=0; i<1000; i++) ret += __builtin_expect(isinf (d[i]),0); #endif return ret; } > > What are the effects on code size (especially for fpclassify)? If code > > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be > > considered. > > Codesize of what? Few applications use these functions... GLIBC mathlib is > the main example, but it's not typical as it has extremely high usage of these > functions and you'll unlikely want to avoid inlining as the performance gain > and avoidance of explicit calls are significant benefits in the math functions > (no need to use callee-saves). > > I suppose adding a size check for fpclassify would be reasonable as the > expansion ends up quite large (but that should really be fixed). > Note fpclassify is used rather inappropriately in most cases in GLIBC, so > the best approach would be rewrite those to use faster isnan/isinf checks. > Remember cost of cache, you are really optimizing for performance when assuming that instruction cache misses are problem. It is worth to optimize even by increasing size, just be aware of instruction cost. Rule of thumb is that extra byte should save a cycle. fpclassify should have around same impact as cache usage as isinf. Just check of exponent would ever be in cache. gcc should move rest of function to separate cache line with cold code that is almost never read and obviously has no impact on performance. Also my guess is that it would improve size in most cases but it depends on caller. Also you shouldn't rewrite that to isnan/isinf checks as that would probably harm performance. Again reason is that gcc will delete unused branches so function size depends on what do you check and is smaller than separate checks. Using your builtin on x64 you will get following assembly int result = __builtin_fpclassify (FP_NAN, FP_INFINITE, \ FP_NORMAL, FP_SUBNORMAL, FP_ZERO, d[i]); followed by if (result == FP_NORMAL) 48: f2 0f 10 02 movsd (%rdx),%xmm0 4c: 66 0f 54 c1 andpd %xmm1,%xmm0 50: 66 0f 2e c0 ucomisd %xmm0,%xmm0 54: 7a 10 jp 66 <main+0x66> 56: 66 0f 2e c2 ucomisd %xmm2,%xmm0 5a: 77 0a ja 66 <main+0x66> 5c: 66 0f 2e c4 ucomisd %xmm4,%xmm0 60: 72 04 jb 66 <main+0x66> 62: f2 0f 58 dd addsd %xmm5,%xmm3 or if (result == FP_NAN) 41: f2 0f 10 02 movsd (%rdx),%xmm0 45: 66 0f 54 c1 andpd %xmm1,%xmm0 49: 66 0f 2e c0 ucomisd %xmm0,%xmm0 4d: 7b e9 jnp 38 <main+0x38> or if (result == FP_INFINITE || result == FP_ZERO) 48: f2 0f 10 02 movsd (%rdx),%xmm0 4c: 66 0f 54 c2 andpd %xmm2,%xmm0 50: 66 0f 2e c0 ucomisd %xmm0,%xmm0 54: 7a 06 jp 5c <main+0x5c> 56: 66 0f 2e c3 ucomisd %xmm3,%xmm0 5a: 76 04 jbe 60 <main+0x60> Note that FP_NORMAL has poor performance and you should replace it by inline as none of three branches there is taken, while what I submitted in review thread has one branch not taken. > > > As a result of this many target overrides and the various > > > __isnan/__finite inlines in math_private.h are no longer required. If > > > agreed we could remove all this code and only keep the generic > > > definition of isinf/etc which will use the builtin. > > > > How do those inlines compare with the code GCC generates? As I understand > > it, before your patch libm-internal calls to these macros would have > > called the inline versions of functions such as __finite (no function > > call, PLT or otherwise, and using integer arithmetic), and after the patch > > libm-internal calls would have used the GCC inlines (with floating-point > > arithmetic) - is the new state for libm-internal calls better than the old > > state or not? > > __isinf(f/l) were never inlined, but are now (__isinf_ns was used in some cases, > but not nearly all). Fpclassify*, isnormal* were never inlined, but now are. These > new inlines provide the gains. > > __isnan(f) and __finite(f) used GLIBC internal inlines, and now use GCC built-ins > instead. We could measure whether this specific change causes any difference, but > I bet it will be very uarch specific and minor compared to the total time math > functions take. > > Also I don't think having special inlines that are only used inside GLIBC is a > good approach - if the GCC built-ins are not fast enough then we should fix them. >
On Mon, 15 Jun 2015, Ondřej Bílka wrote: > As I wrote in other thread that gcc builtins have poor performance a > benchmark is tricky. Main problem is that these tests are in branch and > gcc will simplify them. As builtins are branchless it obviously couldn't > simplify them. Even a poor benchmark, checked into the benchtests directory, would be a starting point for improved benchmarks as well as for benchmarking any future improvements to these functions. Having a benchmark that everyone can readily use with glibc is better than having a performance improvement asserted in a patch submission without the benchmark being available at all. It isn't necessary to show that the use of built-in functions here is optimal. Simply provide evidence that (a) it's at least as good as the existing out-of-line functions, for calls from user programs, and (b) libm functions that previously used glibc-internal inlines, and would use GCC built-in functions after the patch, don't suffer any significant performance regression from that change.
On Mon, Jun 15, 2015 at 09:35:22PM +0000, Joseph Myers wrote: > On Mon, 15 Jun 2015, Ondřej Bílka wrote: > > > As I wrote in other thread that gcc builtins have poor performance a > > benchmark is tricky. Main problem is that these tests are in branch and > > gcc will simplify them. As builtins are branchless it obviously couldn't > > simplify them. > > Even a poor benchmark, checked into the benchtests directory, would be a > starting point for improved benchmarks as well as for benchmarking any > future improvements to these functions. Having a benchmark that everyone > can readily use with glibc is better than having a performance improvement > asserted in a patch submission without the benchmark being available at > all. > No, a poor benchmark is dangerous and much worse than none at all. With poor benchmark you could easily check performance regression that looks like improvement on benchmark and you wouldn't notice until some developer measures poor performance of his application and finds that problem is on his side. I could almost "show" that fpclassify gcc builtin is slower than library call, in benchmark below I exploit branch misprediction to get close. If I could use "benchtest" below I could "improve" fpclassify by making zero check branchless which would improve benchmark numbers to actually beat call overhead. Or I could play with probability of subnormals to increase running time of gcc builtin and decrease of library. Moral is that with poor benchmark your implementation will be poor as it tries to minimize benchmark. With d[i]=(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0); there is almost no difference 06:28:15:~$ gcc -O3 ft.c -lm 06:28:19:~$ time ./a.out real 0m0.719s user 0m0.712s sys 0m0.004s 06:28:26:~$ gcc -O3 ft.c -lm -DBUILTIN 06:28:39:~$ time ./a.out real 0m0.677s user 0m0.676s sys 0m0.000s While when I change that to d[i]=4.2 then there is big improvement. 06:50:55:~$ gcc -O3 ft.c -lm 06:51:00:~$ time ./a.out real 0m0.624s user 0m0.624s sys 0m0.000s 06:51:02:~$ gcc -O3 ft.c -lm -DBUILTIN 06:51:12:~$ time ./a.out real 0m0.384s user 0m0.380s sys 0m0.000s > It isn't necessary to show that the use of built-in functions here is > optimal. Simply provide evidence that (a) it's at least as good as the > existing out-of-line functions, for calls from user programs, and (b) libm > functions that previously used glibc-internal inlines, and would use GCC > built-in functions after the patch, don't suffer any significant > performance regression from that change. > Which as I write before isn't as due gcc builtin poor implementation. You would need to fix that or add inlines for optimal performance. And variant of my "benchmark" is here: #include <stdlib.h> #include <stdio.h> #include <math.h> #ifdef BUILTIN int __attribute__((noinline)) nor(double x) { return __builtin_expect( __builtin_fpclassify (FP_NAN, FP_INFINITE, \ FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0); } #else int __attribute__((noinline)) nor(double x) { return fpclassify (x); } #endif int main() { double ret= 0.0; int i, j; double *d = malloc (800000); for (i=0;i<1000;i++) d[i]=(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0); for (j=0; j<100000; j++) for (i=0; i<1000; i++){ int result = nor(d[i]); if (result == FP_NORMAL) ret += 42; } return ret; }
On 16-06-2015 02:00, Ondřej Bílka wrote: > On Mon, Jun 15, 2015 at 09:35:22PM +0000, Joseph Myers wrote: >> On Mon, 15 Jun 2015, Ondřej Bílka wrote: >> >>> As I wrote in other thread that gcc builtins have poor performance a >>> benchmark is tricky. Main problem is that these tests are in branch and >>> gcc will simplify them. As builtins are branchless it obviously couldn't >>> simplify them. >> >> Even a poor benchmark, checked into the benchtests directory, would be a >> starting point for improved benchmarks as well as for benchmarking any >> future improvements to these functions. Having a benchmark that everyone >> can readily use with glibc is better than having a performance improvement >> asserted in a patch submission without the benchmark being available at >> all. >> > No, a poor benchmark is dangerous and much worse than none at all. With > poor benchmark you could easily check performance regression that looks > like improvement on benchmark and you wouldn't notice until some > developer measures poor performance of his application and finds that > problem is on his side. > > I could almost "show" that fpclassify gcc builtin is slower than library > call, in benchmark below I exploit branch misprediction to get close. If > I could use "benchtest" below I could "improve" fpclassify by making > zero check branchless which would improve benchmark numbers to actually > beat call overhead. Or I could play with probability of subnormals to > increase running time of gcc builtin and decrease of library. Moral is > that with poor benchmark your implementation will be poor as it tries to > minimize benchmark. So to make this proposal to move forward, how exactly do you propose to create a benchtest for such scenario? I get this is tricky and a lot of variables may apply, but I do agree with Joseph that we shouldn't quite aim for optimal performance, imho using compiler builtins with reasonable performance is a gain in code maintainability. So from various code pieces you have thrown in maillist, I see that we may focus on a benchmark that uses a random sample with different probability scenarios FP number types: 1. high prob for normal 2. high prob for nan 3. high prob for inf, 4. high prob for subnormal And 2, 3, 4 should not be the optimization focus (since they are not the usual case for mostly of computations and algorithms). Do you propose something different?
On Tue, Jun 16, 2015 at 09:31:02AM -0300, Adhemerval Zanella wrote: > > > On 16-06-2015 02:00, Ondřej Bílka wrote: > > On Mon, Jun 15, 2015 at 09:35:22PM +0000, Joseph Myers wrote: > >> On Mon, 15 Jun 2015, Ondřej Bílka wrote: > >> > >>> As I wrote in other thread that gcc builtins have poor performance a > >>> benchmark is tricky. Main problem is that these tests are in branch and > >>> gcc will simplify them. As builtins are branchless it obviously couldn't > >>> simplify them. > >> > >> Even a poor benchmark, checked into the benchtests directory, would be a > >> starting point for improved benchmarks as well as for benchmarking any > >> future improvements to these functions. Having a benchmark that everyone > >> can readily use with glibc is better than having a performance improvement > >> asserted in a patch submission without the benchmark being available at > >> all. > >> > > No, a poor benchmark is dangerous and much worse than none at all. With > > poor benchmark you could easily check performance regression that looks > > like improvement on benchmark and you wouldn't notice until some > > developer measures poor performance of his application and finds that > > problem is on his side. > > > > I could almost "show" that fpclassify gcc builtin is slower than library > > call, in benchmark below I exploit branch misprediction to get close. If > > I could use "benchtest" below I could "improve" fpclassify by making > > zero check branchless which would improve benchmark numbers to actually > > beat call overhead. Or I could play with probability of subnormals to > > increase running time of gcc builtin and decrease of library. Moral is > > that with poor benchmark your implementation will be poor as it tries to > > minimize benchmark. > > So to make this proposal to move forward, how exactly do you propose to > create a benchtest for such scenario? I get this is tricky and a lot of > variables may apply, but I do agree with Joseph that we shouldn't quite > aim for optimal performance, imho using compiler builtins with reasonable > performance is a gain in code maintainability. > As I said before about these they are hard to measure and I could argue also versus my benchmark that its inaccurate as it doesn't measure effect of cpu pipeline when function does other computation. Answer is don't do microbenchmark. Take existing math function with benchtest and see what performance difference you gain. Most complex math functions start with tests like if (isfinite(x) && isfinite(y)) so there should be measurable performance improvement, possibly add these if they aren't covered. > So from various code pieces you have thrown in maillist, I see that we > may focus on a benchmark that uses a random sample with different > probability scenarios FP number types: > > 1. high prob for normal > 2. high prob for nan > 3. high prob for inf, > 4. high prob for subnormal > > And 2, 3, 4 should not be the optimization focus (since they are not the > usual case for mostly of computations and algorithms). Do you propose > something different? No, I done that to show that microbenchmarks tend to be wrong, you need to do benchmarking to rule that out. Depending on distribution of these cases a different implemenation will become optimal and with some work I could find distribution where gcc builtin decision tree cost is more than saved overhead. That just means to go collect data to avoid problems like this. And you forget case 5. high prob of zero. That could make fpclassify benchmark misleading when you just use zeroed array like I did with original benchmark.
> -----Original Message----- > From: Joseph Myers [mailto:joseph@codesourcery.com] > Sent: 15 June 2015 18:01 > To: Wilco Dijkstra > Cc: GNU C Library > Subject: RE: [PATCH] Inline C99 math functions > > On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > > > > Where are the benchmarks for this? Please put them in benchtests so > > > actual reproducible figures can be given. That's the standard practice > > > for any change being justified on the basis of performance. > > > > I'll add a benchmark in another patch - it's not trivial as benchtest is not > > suitable to accurately time very simple functions, especially when inlined... > > Well, the benchmark should come first.... I added a new math-inlines benchmark based on the string benchmark infrastructure. I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN, and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57: __fpclassify_t: 8.76 7.04 fpclassify_t: 4.91 5.17 __isnormal_inl_t: 8.77 7.16 isnormal_t: 3.16 3.17 __finite_inl_t: 1.91 1.91 __finite_t: 15.29 15.28 isfinite_t: 1.28 1.28 __isinf_inl_t: 1.92 2.99 __isinf_t: 8.9 6.17 isinf_t: 1.28 1.28 __isnan_inl_t: 1.91 1.92 __isnan_t: 15.28 15.28 isnan_t: 1 1.01 The plain isnan_t functions use the GCC built-ins, the _inl variant uses the existing math_private.h inlines (with __isinf fixed to return the sign too), and the __isnan variants are the non-inline GLIBC functions. So this clearly shows the GCC built-ins win by a huge margin, including the inline versions. It also shows that multiple isinf/isnan calls would be faster than a single inlined fpclassify... > > > What are the effects on code size (especially for fpclassify)? If code > > > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be > > > considered. > > > > Codesize of what? Few applications use these functions... GLIBC mathlib is > > Size of any code calling these macros (for nonconstant arguments). Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes due to the callee-save overhead of the function call. The builtin isinf uses 3 instructions inside the loop plus 3 lifted before it, while the call to __isinf needs 3 plus a lot of code to save/restore the callee-saves. > > Also I don't think having special inlines that are only used inside > > GLIBC is a good approach - if the GCC built-ins are not fast enough then > > we should fix them. > > Yes, we should improve the built-in functions, but first we should > understand the effects on performance of glibc libm functions (I don't > know if the existing benchtests cover cases where __finite / __isnan / > __isinf_ns inlines were used) to see if this cleanup patch indeed doesn't > significantly harm performance of affected libm functions (and possibly > improves performance through the changes in cases that wouldn't previously > have been inlined at all). A run of the math tests doesn't show up any obvious differences beyond the usual variations from run to run. I suspect the difference due to inlining is in the noise for expensive math functions. Wilco
On Tue, 16 Jun 2015, Wilco Dijkstra wrote: > > Well, the benchmark should come first.... > > I added a new math-inlines benchmark based on the string benchmark > infrastructure. Thanks. I await the patch submission. > So this clearly shows the GCC built-ins win by a huge margin, including the > inline versions. It also shows that multiple isinf/isnan calls would be faster That's interesting information - suggesting that changes in GCC to use integer arithmetic should be conditional on -fsignaling-nans, if doing the operations by integer arithmetic is slower (at least on this processor). (It also suggests it's safe to remove the existing glibc-internal inlines as part of moving to using the built-in functions when possible.) > > > Codesize of what? Few applications use these functions... GLIBC mathlib is > > > > Size of any code calling these macros (for nonconstant arguments). > > Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes > due to the callee-save overhead of the function call. The builtin isinf uses > 3 instructions inside the loop plus 3 lifted before it, while the call to > __isinf needs 3 plus a lot of code to save/restore the callee-saves. One might suppose that most functions using these macros contain other function calls as well, and so that the callee-save overhead should not be included in the comparison. When you exclude callee-save overhead, how do things compare for fpclassify (the main case where inlining may be questionable when optimizing for size)?
On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote: > > > > -----Original Message----- > > From: Joseph Myers [mailto:joseph@codesourcery.com] > > Sent: 15 June 2015 18:01 > > To: Wilco Dijkstra > > Cc: GNU C Library > > Subject: RE: [PATCH] Inline C99 math functions > > > > On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > > > > > > Where are the benchmarks for this? Please put them in benchtests so > > > > actual reproducible figures can be given. That's the standard practice > > > > for any change being justified on the basis of performance. > > > > > > I'll add a benchmark in another patch - it's not trivial as benchtest is not > > > suitable to accurately time very simple functions, especially when inlined... > > > > Well, the benchmark should come first.... > > I added a new math-inlines benchmark based on the string benchmark infrastructure. > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN, > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57: > Where is benchmark, there are several things that could go wrong with it. > __fpclassify_t: 8.76 7.04 > fpclassify_t: 4.91 5.17 > __isnormal_inl_t: 8.77 7.16 > isnormal_t: 3.16 3.17 Where did you get inline? I couldn't find it anywhere. Also such big number for inline implementation is suspect > __finite_inl_t: 1.91 1.91 > __finite_t: 15.29 15.28 > isfinite_t: 1.28 1.28 > __isinf_inl_t: 1.92 2.99 > __isinf_t: 8.9 6.17 > isinf_t: 1.28 1.28 > __isnan_inl_t: 1.91 1.92 > __isnan_t: 15.28 15.28 > isnan_t: 1 1.01 > > The plain isnan_t functions use the GCC built-ins, the _inl variant uses the > existing math_private.h inlines (with __isinf fixed to return the sign too), > and the __isnan variants are the non-inline GLIBC functions. > > So this clearly shows the GCC built-ins win by a huge margin, including the > inline versions. That looks bit suspect, submit a benchmark to see if its correct or not. > It also shows that multiple isinf/isnan calls would be faster > than a single inlined fpclassify... > No, thats completely wrong. When you look on assembly when using __builtin_isnan then its identical to one of (on x64 but I doubt that arm gcc is worse) __builtin_fpclassify (FP_NAN, FP_INFINITE, FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0) == FP_NAN So removing fpclassify wouldn't in better case didn't change performance at all, in worse one it would harm it due to duplicated checks. > > > > What are the effects on code size (especially for fpclassify)? If code > > > > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be > > > > considered. > > > > > > Codesize of what? Few applications use these functions... GLIBC mathlib is > > > > Size of any code calling these macros (for nonconstant arguments). > > Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes > due to the callee-save overhead of the function call. The builtin isinf uses > 3 instructions inside the loop plus 3 lifted before it, while the call to > __isinf needs 3 plus a lot of code to save/restore the callee-saves. > > > > Also I don't think having special inlines that are only used inside > > > GLIBC is a good approach - if the GCC built-ins are not fast enough then > > > we should fix them. > > > > Yes, we should improve the built-in functions, but first we should > > understand the effects on performance of glibc libm functions (I don't > > know if the existing benchtests cover cases where __finite / __isnan / > > __isinf_ns inlines were used) to see if this cleanup patch indeed doesn't > > significantly harm performance of affected libm functions (and possibly > > improves performance through the changes in cases that wouldn't previously > > have been inlined at all). > > A run of the math tests doesn't show up any obvious differences beyond the > usual variations from run to run. I suspect the difference due to inlining > is in the noise for expensive math functions. > Look at complex math, these use it. For real math you need to pick specific inputs to trigger unlikely path that uses isinf...
> Ondřej Bílka wrote: > On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote: > > I added a new math-inlines benchmark based on the string benchmark infrastructure. > > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN, > > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57: > > > Where is benchmark, there are several things that could go wrong with it. I'll send it when I can (it has to go through review etc). > > __fpclassify_t: 8.76 7.04 > > fpclassify_t: 4.91 5.17 > > > __isnormal_inl_t: 8.77 7.16 > > isnormal_t: 3.16 3.17 > > Where did you get inline? I couldn't find it anywhere. Also such big > number for inline implementation is suspect It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad idea and the reason for the low performance. Although the GCC isnormal builtin is not particularly fast, it still beats it by more than a factor of 2. > > __finite_inl_t: 1.91 1.91 > > __finite_t: 15.29 15.28 > > isfinite_t: 1.28 1.28 > > __isinf_inl_t: 1.92 2.99 > > __isinf_t: 8.9 6.17 > > isinf_t: 1.28 1.28 > > __isnan_inl_t: 1.91 1.92 > > __isnan_t: 15.28 15.28 > > isnan_t: 1 1.01 > > > > The plain isnan_t functions use the GCC built-ins, the _inl variant uses the > > existing math_private.h inlines (with __isinf fixed to return the sign too), > > and the __isnan variants are the non-inline GLIBC functions. > > > > So this clearly shows the GCC built-ins win by a huge margin, including the > > inline versions. > That looks bit suspect, submit a benchmark to see if its correct or not. It's certainly correct, but obviously different microarchitectures will show different results. Note the GLIBC private inlines are not particularly good. > > It also shows that multiple isinf/isnan calls would be faster > > than a single inlined fpclassify... > > > No, thats completely wrong. When you look on assembly when using __builtin_isnan > then its identical to one of (on x64 but I doubt that arm gcc is worse) > > __builtin_fpclassify (FP_NAN, FP_INFINITE, FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0) == FP_NAN > > So removing fpclassify wouldn't in better case didn't change performance > at all, in worse one it would harm it due to duplicated checks. Fpclassify basically does several checks if you save the result in a variable and executes some branches. So you are far better off using dedicated checks if you just need 2 or 3 of the 5 possible results. And depending on how the code is structured you may only ever execute 1 check. That is far cheaper than first computing the full result for fpclassify and then testing that. > > A run of the math tests doesn't show up any obvious differences beyond the > > usual variations from run to run. I suspect the difference due to inlining > > is in the noise for expensive math functions. > > > Look at complex math, these use it. For real math you need to pick > specific inputs to trigger unlikely path that uses isinf... Yes this needs a dedicated test. Still if we save a cycle in a 100 cycle function, it is hard to show it given modern OoO CPUs can have 5% variation from run to run... Wilco
On 06/15/2015 01:00 PM, Joseph Myers wrote: > On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > >>> Where are the benchmarks for this? Please put them in benchtests so >>> actual reproducible figures can be given. That's the standard practice >>> for any change being justified on the basis of performance. >> >> I'll add a benchmark in another patch - it's not trivial as benchtest is not >> suitable to accurately time very simple functions, especially when inlined... > > Well, the benchmark should come first.... Agreed. If it's not trivial to test the performance... then how did you test it? How do I test it when reviewing the patch? Cheers, Carlos.
On 06/16/2015 09:43 AM, Ondřej Bílka wrote: >> So to make this proposal to move forward, how exactly do you propose to >> create a benchtest for such scenario? I get this is tricky and a lot of >> variables may apply, but I do agree with Joseph that we shouldn't quite >> aim for optimal performance, imho using compiler builtins with reasonable >> performance is a gain in code maintainability. >> > As I said before about these they are hard to measure and I could > argue also versus my benchmark that its inaccurate as it doesn't measure > effect of cpu pipeline when function does other computation. Answer is > don't do microbenchmark. That's not an answer, an answer is "Here's a patch to extend the libm testing to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance." I agree that microbenchmarks can be misleading if interpreted by automated systems, but we aren't talking about that yet, we are talking about experts using these tools to discuss patches in an objective fashion. Cheers, Carlos.
On Tue, Jun 16, 2015 at 06:53:39PM +0100, Wilco Dijkstra wrote: > > Ondřej Bílka wrote: > > On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote: > > > I added a new math-inlines benchmark based on the string benchmark infrastructure. > > > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN, > > > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57: > > > > > Where is benchmark, there are several things that could go wrong with it. > > I'll send it when I can (it has to go through review etc). > > > > __fpclassify_t: 8.76 7.04 > > > fpclassify_t: 4.91 5.17 > > > > > __isnormal_inl_t: 8.77 7.16 > > > isnormal_t: 3.16 3.17 > > > > Where did you get inline? I couldn't find it anywhere. Also such big > > number for inline implementation is suspect > > It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad > idea and the reason for the low performance. Although the GCC isnormal builtin > is not particularly fast, it still beats it by more than a factor of 2. > No, bad idea was not inlining fpclassify, that affects most of performance difference. There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as it tests unlikely cases first but that is secondary. > > > __finite_inl_t: 1.91 1.91 > > > __finite_t: 15.29 15.28 > > > isfinite_t: 1.28 1.28 > > > __isinf_inl_t: 1.92 2.99 > > > __isinf_t: 8.9 6.17 > > > isinf_t: 1.28 1.28 > > > __isnan_inl_t: 1.91 1.92 > > > __isnan_t: 15.28 15.28 > > > isnan_t: 1 1.01 > > > > > > The plain isnan_t functions use the GCC built-ins, the _inl variant uses the > > > existing math_private.h inlines (with __isinf fixed to return the sign too), > > > and the __isnan variants are the non-inline GLIBC functions. > > > > > > So this clearly shows the GCC built-ins win by a huge margin, including the > > > inline versions. > > That looks bit suspect, submit a benchmark to see if its correct or not. > > It's certainly correct, but obviously different microarchitectures will show > different results. Note the GLIBC private inlines are not particularly good. > No, problem is that different benchmarks show different results on same architecture. To speed things up run following to test all cases of environment. Run attached tf script to get results on arm. I am concerned on libc inlines that they need two big constants. That sohuldn't be problem with loop but should in initial checks, so I need benchmark to see what case you did test. Alter that only thing I could come is that move from fp register has slow latency compared to floating comparison. That don't have to mean anything on OoO cpu which hide that latency by executing instructions after that. Second theory is that gcc decided that its infinity is likely which leads to suboptimal assembly. We should also add expect to header to add that information On x64 I get following results depending how gcc optimizes code which is controled by inlining don't inline conditional add branched real 0m1.313s user 0m1.312s sys 0m0.000s builtin real 0m1.309s user 0m1.308s sys 0m0.000s branch branched real 0m1.310s user 0m1.308s sys 0m0.000s builtin real 0m1.337s user 0m1.312s sys 0m0.004s sum branched real 0m1.209s user 0m1.204s sys 0m0.000s builtin real 0m1.216s user 0m1.212s sys 0m0.000s inline outer call conditional add branched real 0m0.705s user 0m0.704s sys 0m0.000s builtin real 0m0.916s user 0m0.916s sys 0m0.000s branch branched real 0m0.806s user 0m0.804s sys 0m0.000s builtin real 0m0.721s user 0m0.716s sys 0m0.000s sum branched real 0m1.029s user 0m1.028s sys 0m0.000s builtin real 0m0.911s user 0m0.908s sys 0m0.000s inline inner call conditional add branched real 0m1.038s user 0m1.032s sys 0m0.000s builtin real 0m1.024s user 0m1.016s sys 0m0.000s branch branched real 0m0.614s user 0m0.608s sys 0m0.000s builtin real 0m0.606s user 0m0.608s sys 0m0.000s sum branched real 0m0.662s user 0m0.660s sys 0m0.000s builtin real 0m0.629s user 0m0.628s sys 0m0.000s tigth loop conditional add branched real 0m0.208s user 0m0.208s sys 0m0.000s builtin real 0m0.326s user 0m0.324s sys 0m0.000s branch branched real 0m0.204s user 0m0.200s sys 0m0.000s builtin real 0m0.325s user 0m0.324s sys 0m0.000s sum branched real 0m0.328s user 0m0.332s sys 0m0.000s builtin real 0m0.486s user 0m0.484s sys 0m0.000s > > > It also shows that multiple isinf/isnan calls would be faster > > > than a single inlined fpclassify... > > > > > No, thats completely wrong. When you look on assembly when using __builtin_isnan > > then its identical to one of (on x64 but I doubt that arm gcc is worse) > > > > __builtin_fpclassify (FP_NAN, FP_INFINITE, FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0) == FP_NAN > > > > So removing fpclassify wouldn't in better case didn't change performance > > at all, in worse one it would harm it due to duplicated checks. > > Fpclassify basically does several checks if you save the result in a variable and > executes some branches. So you are far better off using dedicated checks if you > just need 2 or 3 of the 5 possible results. And depending on how the code is > structured you may only ever execute 1 check. That is far cheaper than first > computing the full result for fpclassify and then testing that. > Which doesn't matter. As gcc optimized unneded checks away you won't do unneeded checks. As using: __builtin_fpclassify (FP_NAN, FP_INFINITE, \ FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0); return result == FP_INFINITE || result == FP_NAN; is slower than: return __builtin_isinf (x) || __builtin_isnan (x); Your claim is false, run attached tf2 script to test. conditional add isnan+isinf real 0m0.141s user 0m0.140s sys 0m0.000s classify real 0m0.136s user 0m0.132s sys 0m0.000s branch isnan+isinf real 0m0.158s user 0m0.156s sys 0m0.000s classify real 0m0.136s user 0m0.132s sys 0m0.000s sum isnan+isinf real 0m0.237s user 0m0.236s sys 0m0.000s classify real 0m0.202s user 0m0.200s sys 0m0.000s > > > A run of the math tests doesn't show up any obvious differences beyond the > > > usual variations from run to run. I suspect the difference due to inlining > > > is in the noise for expensive math functions. > > > > > Look at complex math, these use it. For real math you need to pick > > specific inputs to trigger unlikely path that uses isinf... > > Yes this needs a dedicated test. Still if we save a cycle in a 100 cycle function, > it is hard to show it given modern OoO CPUs can have 5% variation from run to run... > It isn't hard. You just need to run it long enough until variance is below one cycle. Also with OoO its even more important as while function could be slower on benchmark it could be faster in practice as it takes advantage of that and selects instructions that are easy to OoO so most happens in parallel. #include <stdlib.h> #include <stdio.h> //#include <math.h> #undef isinf2 #ifdef BRANCHED int I2 isinf2 (double dx) { union u { double d; long l; }; union u u; u.d = dx; long x = u.l; return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 : -1) : 0; } #else int I2 isinf2(double x) { return __builtin_isinf (x); } #endif void I1 t1(double x,double *d) { if (isinf2 (x)) *d+=42; } void I1 t2(double x,double *d) { if (isinf2 (x)) abort(); } void I1 t3(double x,double *d) { *d += isinf2 (x); } int main() { double ret= 0.0; int i, j; double *d = malloc (800000); for (i=0;i<1000;i++) d[i]=4.2;//(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0); for (j=0; j<300000; j++) #ifdef T1 for (i=0; i<1000; i++) t1(d[i], &ret); #endif #ifdef T2 for (i=0; i<1000; i++) t2(d[i], &ret); #endif #ifdef T3 for (i=0; i<1000; i++) t3(d[i], &ret); #endif return ret; } #include <stdlib.h> #include <stdio.h> #if defined CLAS #include <math.h> #undef isinf2 static int isinf2(double x) { int result = __builtin_expect( __builtin_fpclassify (FP_NAN, FP_INFINITE, \ FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0); return result == FP_INFINITE || result == FP_NAN; } #else static int isinf2(double x) { return __builtin_isinf (x) || __builtin_isnan (x); } #endif int main() { double ret= 0.0; int i, j; double *d = malloc (800000); for (i=0;i<1000;i++) d[i]=4.2;//(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0); for (j=0; j<100000; j++) #ifdef T1 for (i=0; i<1000; i++) if (__builtin_expect(isinf2 (d[i]),0)) ret += 42; #endif #ifdef T2 for (i=0; i<1000; i++) if (__builtin_expect(isinf2 (d[i]),0)) abort(); #endif #ifdef T3 for (i=0; i<1000; i++) ret += __builtin_expect(isinf2 (d[i]),0); #endif return ret; } test() { gcc -O3 -DBRANCHED -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null echo branched time ./a.out gcc -O3 -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null echo builtin time ./a.out } test2() { echo conditional add test $1 $2 1 echo branch test $1 $2 2 echo sum test $1 $2 3 } echo "don't inline" test2 '__attribute__((noinline))' '__attribute__((noinline))' echo inline outer call test2 '__attribute__((noinline))' '__attribute__((always_inline))' echo inline inner call test2 '__attribute__((always_inline))' '__attribute__((noinline))' echo tigth loop test2 '__attribute__((always_inline))' '__attribute__((always_inline))' echo conditional add gcc -O3 ft2.c -DT1 echo isnan+isinf time ./a.out gcc -O3 ft2.c -DT1 -DCLAS echo classify time ./a.out echo branch gcc -O3 ft2.c -DT2 echo isnan+isinf time ./a.out gcc -O3 ft2.c -DT2 -DCLAS echo classify time ./a.out echo sum gcc -O3 ft2.c -DT3 echo isnan+isinf time ./a.out gcc -O3 ft2.c -DT3 -DCLAS echo classify time ./a.out
> Carlos O'Donell wrote: > On 06/15/2015 01:00 PM, Joseph Myers wrote: > > On Mon, 15 Jun 2015, Wilco Dijkstra wrote: > > > >>> Where are the benchmarks for this? Please put them in benchtests so > >>> actual reproducible figures can be given. That's the standard practice > >>> for any change being justified on the basis of performance. > >> > >> I'll add a benchmark in another patch - it's not trivial as benchtest is not > >> suitable to accurately time very simple functions, especially when inlined... > > > > Well, the benchmark should come first.... > > Agreed. > > If it's not trivial to test the performance... then how did you test it? > > How do I test it when reviewing the patch? Besides checking it was correctly inlining as expected, I ran the usual benchmarks like SPECFP to check there are no performance regressions due to the math libs. Adding more inlining is simply a no-brainer, especially when we're talking about 3-4 instruction functions and not just avoiding a call but also a PLT indirection... This provides more than a factor 15 speedup in my microbenchmarks. Wilco
> Ondřej Bílka wrote: > On Tue, Jun 16, 2015 at 06:53:39PM +0100, Wilco Dijkstra wrote: > > > Ondřej Bílka wrote: > > > On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote: > > > > I added a new math-inlines benchmark based on the string benchmark infrastructure. > > > > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN, > > > > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57: > > > > > > > Where is benchmark, there are several things that could go wrong with it. > > > > I'll send it when I can (it has to go through review etc). > > > > > > __fpclassify_t: 8.76 7.04 > > > > fpclassify_t: 4.91 5.17 > > > > > > > __isnormal_inl_t: 8.77 7.16 > > > > isnormal_t: 3.16 3.17 > > > > > > Where did you get inline? I couldn't find it anywhere. Also such big > > > number for inline implementation is suspect > > > > It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad > > idea and the reason for the low performance. Although the GCC isnormal builtin > > is not particularly fast, it still beats it by more than a factor of 2. > > > No, bad idea was not inlining fpclassify, that affects most of performance difference. > There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as > it tests unlikely cases first but that is secondary. Even with the inlined fpclassify (inl2 below), isnormal is slower: __isnormal_inl2_t: 1.25 3.67 __isnormal_inl_t: 4.59 2.89 isnormal_t: 1 1 So using a dedicated builtin for isnormal is important. > > It's certainly correct, but obviously different microarchitectures will show > > different results. Note the GLIBC private inlines are not particularly good. > > > No, problem is that different benchmarks show different results on same > architecture. To speed things up run following to test all cases of > environment. Run attached tf script to get results on arm. I tried, but I don't think this is a good benchmark - you're not measuring the FP->int move for the branched version, and you're comparing the signed version of isinf vs the builtin which does isinf_ns. > Which doesn't matter. As gcc optimized unneded checks away you won't do > unneeded checks. As using: > > __builtin_fpclassify (FP_NAN, FP_INFINITE, \ > FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0); > return result == FP_INFINITE || result == FP_NAN; > > is slower than: > > return __builtin_isinf (x) || __builtin_isnan (x); > > Your claim is false, run attached tf2 script to test. That's not what I am seeing, using two explicit isinf/isnan calls (test2) is faster than inlined fpclassify (test1): __fpclassify_test2_t: 1 4.41 __fpclassify_test1_t: 1.23 4.66 Wilco
On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote: > > > > > > > > __fpclassify_t: 8.76 7.04 > > > > > fpclassify_t: 4.91 5.17 > > > > > > > > > __isnormal_inl_t: 8.77 7.16 > > > > > isnormal_t: 3.16 3.17 > > > > > > > > Where did you get inline? I couldn't find it anywhere. Also such big > > > > number for inline implementation is suspect > > > > > > It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad > > > idea and the reason for the low performance. Although the GCC isnormal builtin > > > is not particularly fast, it still beats it by more than a factor of 2. > > > > > No, bad idea was not inlining fpclassify, that affects most of performance difference. > > There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as > > it tests unlikely cases first but that is secondary. > > Even with the inlined fpclassify (inl2 below), isnormal is slower: > > __isnormal_inl2_t: 1.25 3.67 > __isnormal_inl_t: 4.59 2.89 > isnormal_t: 1 1 > > So using a dedicated builtin for isnormal is important. > That makes result identical to one of isnan. That its slower is bug in fpclassify which should first check for normal, then do unlikely checks. > > > It's certainly correct, but obviously different microarchitectures will show > > > different results. Note the GLIBC private inlines are not particularly good. > > > > > No, problem is that different benchmarks show different results on same > > architecture. To speed things up run following to test all cases of > > environment. Run attached tf script to get results on arm. > > I tried, but I don't think this is a good benchmark - you're not measuring > the FP->int move for the branched version, and you're comparing the signed > version of isinf vs the builtin which does isinf_ns. > Wilco that isinf is signed is again completely irrelevant. gcc is smart. It get expanded into if (foo ? (bar ? 1 : -1) : 0) that gcc simplifies to if (foo) so it doesn't matter that checking sign would take 100 cycles as its deleted code. Also as it could make branched version only slower when it beats builtin then also nonsigned one would beat builtin. And do you have assembly to show it doesn't measure move or its just your guess? On x64 it definitely measures move and I could add that gcc messes that bit by moving several times. objdump -d on gcc -DT2 -DBRANCHED -DI1="__attribute((always_inline))" -DI2="__attribute__((always_inline))" ft.c -c clearly shows that conversion is done several times. 24c: 48 8b 45 f0 mov -0x10(%rbp),%rax 250: 48 01 d0 add %rdx,%rax 253: f2 0f 10 00 movsd (%rax),%xmm0 257: f2 0f 11 45 e8 movsd %xmm0,-0x18(%rbp) 25c: 48 8d 45 c8 lea -0x38(%rbp),%rax 260: 48 89 45 e0 mov %rax,-0x20(%rbp) 264: f2 0f 10 45 e8 movsd -0x18(%rbp),%xmm0 269: f2 0f 11 45 d8 movsd %xmm0,-0x28(%rbp) 26e: f2 0f 10 45 d8 movsd -0x28(%rbp),%xmm0 273: f2 0f 11 45 c0 movsd %xmm0,-0x40(%rbp) 278: 48 8b 45 c0 mov -0x40(%rbp),%rax 27c: 48 89 45 d0 mov %rax,-0x30(%rbp) 280: 48 8b 45 d0 mov -0x30(%rbp),%rax 284: 48 8d 14 00 lea (%rax,%rax,1),%rdx 288: 48 b8 00 00 00 00 00 movabs $0xffe0000000000000,%rax 28f: 00 e0 ff 292: 48 39 c2 cmp %rax,%rdx 295: 75 1e jne 2b5 <main+0xe3> 297: 48 b8 00 00 00 00 00 movabs $0x7ff0000000000000,%rax 29e: 00 f0 7f 2a1: 48 39 45 d0 cmp %rax,-0x30(%rbp) 2a5: 75 07 jne 2ae <main+0xdc> 2a7: b8 01 00 00 00 mov $0x1,%eax 2ac: eb 0c jmp 2ba <main+0xe8> 2ae: b8 ff ff ff ff mov $0xffffffff,%eax 2b3: eb 05 jmp 2ba <main+0xe8> 2b5: b8 00 00 00 00 mov $0x0,%eax 2ba: 85 c0 test %eax,%eax 2bc: 74 05 je 2c3 <main+0xf1> So will you publish results or not as it would show your builtins in unfavorable ligth? > > Which doesn't matter. As gcc optimized unneded checks away you won't do > > unneeded checks. As using: > > > > __builtin_fpclassify (FP_NAN, FP_INFINITE, \ > > FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0); > > return result == FP_INFINITE || result == FP_NAN; > > > > is slower than: > > > > return __builtin_isinf (x) || __builtin_isnan (x); > > > > Your claim is false, run attached tf2 script to test. > > That's not what I am seeing, using two explicit isinf/isnan calls (test2) is faster > than inlined fpclassify (test1): > > __fpclassify_test2_t: 1 4.41 > __fpclassify_test1_t: 1.23 4.66 > You need to run my benchmark. You get different results if thats inside if or not so I want to also know my results.
On Wed, Jun 17, 2015 at 06:34:33PM +0200, Ondřej Bílka wrote: > > > I tried, but I don't think this is a good benchmark - you're not measuring > > the FP->int move for the branched version, and you're comparing the signed > > version of isinf vs the builtin which does isinf_ns. > > > Wilco that isinf is signed is again completely irrelevant. gcc is smart. > It get expanded into > if (foo ? (bar ? 1 : -1) : 0) > that gcc simplifies to > if (foo) > so it doesn't matter that checking sign would take 100 cycles as its > deleted code. > Forgot to add that you should use gcc-4.9 or higher. I know that 4.7 doesn't do that and don't remember if 4.8 does that.
> Joseph Myers wrote: > On Tue, 16 Jun 2015, Wilco Dijkstra wrote: > > > > Well, the benchmark should come first.... > > > > I added a new math-inlines benchmark based on the string benchmark > > infrastructure. > > Thanks. I await the patch submission. See https://sourceware.org/ml/libc-alpha/2015-06/msg00569.html > > So this clearly shows the GCC built-ins win by a huge margin, including the > > inline versions. It also shows that multiple isinf/isnan calls would be faster > > That's interesting information - suggesting that changes in GCC to use > integer arithmetic should be conditional on -fsignaling-nans, if doing the > operations by integer arithmetic is slower (at least on this processor). > > (It also suggests it's safe to remove the existing glibc-internal inlines > as part of moving to using the built-in functions when possible.) Indeed. To check which sequence is better we'd need to write a better benchmark, maybe base it on a GLIBC function which uses these functions in the hot path. > > > > Codesize of what? Few applications use these functions... GLIBC mathlib is > > > > > > Size of any code calling these macros (for nonconstant arguments). > > > > Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes > > due to the callee-save overhead of the function call. The builtin isinf uses > > 3 instructions inside the loop plus 3 lifted before it, while the call to > > __isinf needs 3 plus a lot of code to save/restore the callee-saves. > > One might suppose that most functions using these macros contain other > function calls as well, and so that the callee-save overhead should not be > included in the comparison. That may be true in some cases, but if you can tailcall (which might be possible in several math veneers) then the callee-save savings would apply. > When you exclude callee-save overhead, how do things compare for > fpclassify (the main case where inlining may be questionable when > optimizing for size)? Well in the worst-case scenario where you need all 5 tests of fpclassify it effectively changes a single-instruction call into 16 instructions plus 2 double immediate. So it is best to use OPTIMIZE_SIZE for fpclassify for now and revisit when the GCC implementation has been improved. I also wonder what the difference would be once I've optimized the __fpclassify implementation - I can do it in about 8-9 instructions. Wilco
> Ondřej Bílka wrote: > On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote: > > Even with the inlined fpclassify (inl2 below), isnormal is slower: > > > > __isnormal_inl2_t: 1.25 3.67 > > __isnormal_inl_t: 4.59 2.89 > > isnormal_t: 1 1 > > > > So using a dedicated builtin for isnormal is important. > > > That makes result identical to one of isnan. That its slower is bug in > fpclassify which should first check for normal, then do unlikely checks. It's about twice as slow as isnan as the isnormal check isn't done efficiently. Fpclassify is slower still as it does 3 comparisons before setting FP_NORMAL. > > > > It's certainly correct, but obviously different microarchitectures will show > > > > different results. Note the GLIBC private inlines are not particularly good. > > > > > > > No, problem is that different benchmarks show different results on same > > > architecture. To speed things up run following to test all cases of > > > environment. Run attached tf script to get results on arm. > > > > I tried, but I don't think this is a good benchmark - you're not measuring > > the FP->int move for the branched version, and you're comparing the signed > > version of isinf vs the builtin which does isinf_ns. > > > Wilco that isinf is signed is again completely irrelevant. gcc is smart. > It get expanded into > if (foo ? (bar ? 1 : -1) : 0) > that gcc simplifies to > if (foo) > so it doesn't matter that checking sign would take 100 cycles as its > deleted code. It matters for the -DT3 test. > Also as it could make branched version only slower when it beats builtin > then also nonsigned one would beat builtin. > > And do you have assembly to show it doesn't measure move or its just > your guess? On x64 it definitely measures move and I could add that gcc > messes that bit by moving several times. objdump -d > on gcc -DT2 -DBRANCHED -DI1="__attribute((always_inline))" - > DI2="__attribute__((always_inline))" ft.c -c > clearly shows that conversion is done several times. Look at what the inner loop generates for T1 (T2/T3 do the same): .L27: ldr x2, [x1] cmp x4, x2, lsl 1 bne .L29 fadd d0, d0, d1 .L29: add x1, x1, 8 cmp x1, x3 bne .L27 > So will you publish results or not as it would show your builtins in > unfavorable ligth? I don't see the point until your benchmark measures the right thing. Note my benchmark carefully avoids this gotcha. Wilco
On Wed, Jun 17, 2015 at 06:22:24PM +0100, Wilco Dijkstra wrote: > > Ondřej Bílka wrote: > > On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote: > > > Even with the inlined fpclassify (inl2 below), isnormal is slower: > > > > > > __isnormal_inl2_t: 1.25 3.67 > > > __isnormal_inl_t: 4.59 2.89 > > > isnormal_t: 1 1 > > > > > > So using a dedicated builtin for isnormal is important. > > > > > That makes result identical to one of isnan. That its slower is bug in > > fpclassify which should first check for normal, then do unlikely checks. > > It's about twice as slow as isnan as the isnormal check isn't done efficiently. > Fpclassify is slower still as it does 3 comparisons before setting FP_NORMAL. > Then how could you explain original data? I couldn't find how you did determined its twice slower, when it takes 1.25 which is faster than isinf 1.28 __isnormal_inl_t: 8.77 7.16 isnormal_t: 3.16 3.17 __isinf_inl_t: 1.92 2.99 __isinf_t: 8.9 6.17 isinf_t: 1.28 1.28 __isnan_inl_t: 1.91 1.92 __isnan_t: 15.28 15.28 isnan_t: 1 1.01 > > > > > It's certainly correct, but obviously different microarchitectures will show > > > > > different results. Note the GLIBC private inlines are not particularly good. > > > > > > > > > No, problem is that different benchmarks show different results on same > > > > architecture. To speed things up run following to test all cases of > > > > environment. Run attached tf script to get results on arm. > > > > > > I tried, but I don't think this is a good benchmark - you're not measuring > > > the FP->int move for the branched version, and you're comparing the signed > > > version of isinf vs the builtin which does isinf_ns. > > > > > Wilco that isinf is signed is again completely irrelevant. gcc is smart. > > It get expanded into > > if (foo ? (bar ? 1 : -1) : 0) > > that gcc simplifies to > > if (foo) > > so it doesn't matter that checking sign would take 100 cycles as its > > deleted code. > > It matters for the -DT3 test. > Only by increasing code size, it should have zero performance impact. It still does first check if positive/negative infinity and jump skips bar check. > > Also as it could make branched version only slower when it beats builtin > > then also nonsigned one would beat builtin. > > > > And do you have assembly to show it doesn't measure move or its just > > your guess? On x64 it definitely measures move and I could add that gcc > > messes that bit by moving several times. objdump -d > > on gcc -DT2 -DBRANCHED -DI1="__attribute((always_inline))" - > > DI2="__attribute__((always_inline))" ft.c -c > > clearly shows that conversion is done several times. > > Look at what the inner loop generates for T1 (T2/T3 do the same): > > .L27: > ldr x2, [x1] > cmp x4, x2, lsl 1 > bne .L29 > fadd d0, d0, d1 > .L29: > add x1, x1, 8 > cmp x1, x3 > bne .L27 > > > So will you publish results or not as it would show your builtins in > > unfavorable ligth? > > I don't see the point until your benchmark measures the right thing. > Note my benchmark carefully avoids this gotcha. > Looks like that arm gcc could optimize better than x64. I did addition to force value into floating register. Then I needed to add volatile to force gcc emit better code. Best would be use assembly but that makes benchmark platform specific. I don't know if dropping volatile would help and don't want to add arm assembly without testing it so add arm assembly if gcc would also produce suboptimal code. That changes make different numbers on x64 which are following: don't inline conditional add branched real 0m1.080s user 0m1.079s sys 0m0.000s builtin real 0m1.079s user 0m1.075s sys 0m0.003s branch branched real 0m1.031s user 0m1.031s sys 0m0.000s builtin real 0m0.848s user 0m0.848s sys 0m0.000s sum branched real 0m1.155s user 0m1.154s sys 0m0.000s builtin real 0m1.003s user 0m1.002s sys 0m0.000s inline outer call conditional add branched real 0m0.618s user 0m0.617s sys 0m0.000s builtin real 0m0.771s user 0m0.771s sys 0m0.000s branch branched real 0m0.618s user 0m0.617s sys 0m0.000s builtin real 0m0.618s user 0m0.617s sys 0m0.000s sum branched real 0m0.693s user 0m0.689s sys 0m0.003s builtin real 0m0.694s user 0m0.693s sys 0m0.000s inline inner call conditional add branched real 0m0.800s user 0m0.800s sys 0m0.000s builtin real 0m0.694s user 0m0.694s sys 0m0.000s branch branched real 0m0.618s user 0m0.614s sys 0m0.003s builtin real 0m0.618s user 0m0.617s sys 0m0.000s sum branched real 0m0.695s user 0m0.694s sys 0m0.000s builtin real 0m0.747s user 0m0.747s sys 0m0.000s tigth loop conditional add branched real 0m0.227s user 0m0.226s sys 0m0.000s builtin real 0m0.255s user 0m0.254s sys 0m0.000s branch branched real 0m0.225s user 0m0.224s sys 0m0.000s builtin real 0m0.234s user 0m0.233s sys 0m0.000s sum branched real 0m0.270s user 0m0.269s sys 0m0.000s builtin real 0m0.391s user 0m0.390s sys 0m0.000s #include <stdlib.h> #include <stdio.h> //#include <math.h> #undef isinf2 #ifdef BRANCHED int I2 isinf2 (double dx) { union u { double d; long l; }; volatile union u u; u.d = dx; long x = u.l; return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 : -1) : 0; } #else int I2 isinf2(double x) { return __builtin_isinf (x); } #endif void I1 t1(double x,double *d) { if (isinf2 (x)) *d+=42; } void I1 t2(double x,double *d) { if (isinf2 (x)) abort(); } void I1 t3(double x,double *d) { *d += isinf2 (x); } int main() { double ret= 0.0; int i, j; double *d = malloc (800000); for (i=0;i<1000;i++) d[i]=4.2;//(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0); for (j=0; j<300000; j++) #ifdef T1 for (i=0; i<1000; i++) t1(d[i]+1.3, &ret); #endif #ifdef T2 for (i=0; i<1000; i++) t2(d[i]+1.3, &ret); #endif #ifdef T3 for (i=0; i<1000; i++) t3(d[i]+1.3, &ret); #endif return ret; } test() { gcc -O3 -DBRANCHED -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null echo branched time ./a.out gcc -O3 -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null echo builtin time ./a.out } test2() { echo conditional add test $1 $2 1 echo branch test $1 $2 2 echo sum test $1 $2 3 } echo "don't inline" test2 '__attribute__((noinline))' '__attribute__((noinline))' echo inline outer call test2 '__attribute__((noinline))' '__attribute__((always_inline))' echo inline inner call test2 '__attribute__((always_inline))' '__attribute__((noinline))' echo tigth loop test2 '__attribute__((always_inline))' '__attribute__((always_inline))'
On Tue, Jun 16, 2015 at 03:54:35PM -0400, Carlos O'Donell wrote: > On 06/16/2015 09:43 AM, Ondřej Bílka wrote: > >> So to make this proposal to move forward, how exactly do you propose to > >> create a benchtest for such scenario? I get this is tricky and a lot of > >> variables may apply, but I do agree with Joseph that we shouldn't quite > >> aim for optimal performance, imho using compiler builtins with reasonable > >> performance is a gain in code maintainability. > >> > > As I said before about these they are hard to measure and I could > > argue also versus my benchmark that its inaccurate as it doesn't measure > > effect of cpu pipeline when function does other computation. Answer is > > don't do microbenchmark. > > That's not an answer, an answer is "Here's a patch to extend the libm testing > to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance." > No its answer as it isn't my responsibility to provide benchmark to convince that change is desirable but submitters. As I said before he should for example add catan to benchtest, measure difference and report that. If necessary increase iteration count to catch difference. Its neccessary anyway if we want to measure microoptimizations that improve performance with several cycles. > I agree that microbenchmarks can be misleading if interpreted by automated > systems, but we aren't talking about that yet, we are talking about experts > using these tools to discuss patches in an objective fashion. > No they will tell you exacty same argument as I said to explain why what you do want is impossible. Carlos you talk lot about deciding objectively but when I ask you out its never done. So I will ask you again to decide based on my previous benchmark. There sometimes builtin is 20% faster and sometimes a current inline is 20% faster. How do you imagine that experts would decide solely on that instead of telling you that its inconclusive and you need to do real world measurements or that benchmark is flawed because X? don't inline conditional add branched real 0m1.313s user 0m1.312s sys 0m0.000s builtin real 0m1.309s user 0m1.308s sys 0m0.000s branch branched real 0m1.310s user 0m1.308s sys 0m0.000s builtin real 0m1.337s user 0m1.312s sys 0m0.004s sum branched real 0m1.209s user 0m1.204s sys 0m0.000s builtin real 0m1.216s user 0m1.212s sys 0m0.000s inline outer call conditional add branched real 0m0.705s user 0m0.704s sys 0m0.000s builtin real 0m0.916s user 0m0.916s sys 0m0.000s branch branched real 0m0.806s user 0m0.804s sys 0m0.000s builtin real 0m0.721s user 0m0.716s sys 0m0.000s sum branched real 0m1.029s user 0m1.028s sys 0m0.000s builtin real 0m0.911s user 0m0.908s sys 0m0.000s inline inner call conditional add branched real 0m1.038s user 0m1.032s sys 0m0.000s builtin real 0m1.024s user 0m1.016s sys 0m0.000s branch branched real 0m0.614s user 0m0.608s sys 0m0.000s builtin real 0m0.606s user 0m0.608s sys 0m0.000s sum branched real 0m0.662s user 0m0.660s sys 0m0.000s builtin real 0m0.629s user 0m0.628s sys 0m0.000s tigth loop conditional add branched real 0m0.208s user 0m0.208s sys 0m0.000s builtin real 0m0.326s user 0m0.324s sys 0m0.000s branch branched real 0m0.204s user 0m0.200s sys 0m0.000s builtin real 0m0.325s user 0m0.324s sys 0m0.000s sum branched real 0m0.328s user 0m0.332s sys 0m0.000s builtin real 0m0.486s user 0m0.484s sys 0m0.000s
On 07/03/2015 04:40 AM, Ondřej Bílka wrote: > On Tue, Jun 16, 2015 at 03:54:35PM -0400, Carlos O'Donell wrote: >> On 06/16/2015 09:43 AM, Ondřej Bílka wrote: >>>> So to make this proposal to move forward, how exactly do you propose to >>>> create a benchtest for such scenario? I get this is tricky and a lot of >>>> variables may apply, but I do agree with Joseph that we shouldn't quite >>>> aim for optimal performance, imho using compiler builtins with reasonable >>>> performance is a gain in code maintainability. >>>> >>> As I said before about these they are hard to measure and I could >>> argue also versus my benchmark that its inaccurate as it doesn't measure >>> effect of cpu pipeline when function does other computation. Answer is >>> don't do microbenchmark. >> >> That's not an answer, an answer is "Here's a patch to extend the libm testing >> to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance." >> > No its answer as it isn't my responsibility to provide benchmark to > convince that change is desirable but submitters. As I said before he should > for example add catan to benchtest, measure difference and report that. > If necessary increase iteration count to catch difference. Its > neccessary anyway if we want to measure microoptimizations that improve > performance with several cycles. I never said it was your responsibility. It is the responsibility of the person submitting the patch to provide an objective description of how they verified the performance gains. It has become standard practice to recommend the author of the patch contribute a microbenchmark, but it need not be a microbenchmark. The author does need to provide sufficient performance justification including the methods used to measure the performance gains to ensure the results can be reproduced. In this case if you believe catan can be used to test the performance in this patch, please suggest this to Wilco. However, I will not accept performance patches without justification for the performance gains. > Carlos you talk lot about deciding objectively but when I ask you out > its never done. So I will ask you again to decide based on my previous > benchmark. There sometimes builtin is 20% faster and sometimes a current > inline is 20% faster. How do you imagine that experts would decide > solely on that instead of telling you that its inconclusive and you need > to do real world measurements or that benchmark is flawed because X? My apologies if I failed to do something you requested. I'm not interested in abstract examples, I'm interested in the patch being submitted by ARM. I will continue this discussion on the downstream thread that includes the microbenchmark written by Wilco. In general I expect there are certainly classes of performance problems that have inconclusive results. In which case I will reject that patch until you tell me how you measure the performance gains and what you expect the performance gains to be on average. Cheers, Carlos.
On Fri, Jul 03, 2015 at 10:28:56AM -0400, Carlos O'Donell wrote: > On 07/03/2015 04:40 AM, Ondřej Bílka wrote: > > On Tue, Jun 16, 2015 at 03:54:35PM -0400, Carlos O'Donell wrote: > >> On 06/16/2015 09:43 AM, Ondřej Bílka wrote: > >>>> So to make this proposal to move forward, how exactly do you propose to > >>>> create a benchtest for such scenario? I get this is tricky and a lot of > >>>> variables may apply, but I do agree with Joseph that we shouldn't quite > >>>> aim for optimal performance, imho using compiler builtins with reasonable > >>>> performance is a gain in code maintainability. > >>>> > >>> As I said before about these they are hard to measure and I could > >>> argue also versus my benchmark that its inaccurate as it doesn't measure > >>> effect of cpu pipeline when function does other computation. Answer is > >>> don't do microbenchmark. > >> > >> That's not an answer, an answer is "Here's a patch to extend the libm testing > >> to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance." > >> > > No its answer as it isn't my responsibility to provide benchmark to > > convince that change is desirable but submitters. As I said before he should > > for example add catan to benchtest, measure difference and report that. > > If necessary increase iteration count to catch difference. Its > > neccessary anyway if we want to measure microoptimizations that improve > > performance with several cycles. > > I never said it was your responsibility. It is the responsibility of the person > submitting the patch to provide an objective description of how they verified > the performance gains. It has become standard practice to recommend the author > of the patch contribute a microbenchmark, but it need not be a microbenchmark. > The author does need to provide sufficient performance justification including > the methods used to measure the performance gains to ensure the results can > be reproduced. > Correct. Here there are plenty of things that could go wrong so I need to point that out. Only benchmark where I would be certain with result would be take one of numerical applications where is* is bottleneck that were mentioned before and measure performance before and after. > In this case if you believe catan can be used to test the performance in this > patch, please suggest this to Wilco. However, I will not accept performance > patches without justification for the performance gains. > Already wrote about that to get more precise answer. > > Carlos you talk lot about deciding objectively but when I ask you out > > its never done. So I will ask you again to decide based on my previous > > benchmark. There sometimes builtin is 20% faster and sometimes a current > > inline is 20% faster. How do you imagine that experts would decide > > solely on that instead of telling you that its inconclusive and you need > > to do real world measurements or that benchmark is flawed because X? > > My apologies if I failed to do something you requested. > > I'm not interested in abstract examples, I'm interested in the patch being > submitted by ARM. I will continue this discussion on the downstream thread > that includes the microbenchmark written by Wilco. > > In general I expect there are certainly classes of performance problems > that have inconclusive results. In which case I will reject that patch until > you tell me how you measure the performance gains and what you expect the > performance gains to be on average. > Problem is that most functions have inconclusive results as we use mined cases where its possible and must use assumptions about input distribution to get speedup. Simplest example is that in string functions a we assume that 64 bytes cross page boundary only rarely and list of these assumptions goes on and on. Then when microbenchmark violates one of these as its simplistic truth is if that gives real speedup of programs not what microbenchmark shows.
On 07/09/2015 11:02 AM, Ondřej Bílka wrote: >> I never said it was your responsibility. It is the responsibility of the person >> submitting the patch to provide an objective description of how they verified >> the performance gains. It has become standard practice to recommend the author >> of the patch contribute a microbenchmark, but it need not be a microbenchmark. >> The author does need to provide sufficient performance justification including >> the methods used to measure the performance gains to ensure the results can >> be reproduced. >> > Correct. Here there are plenty of things that could go wrong so I need > to point that out. Agreed. I would like to see us create a "Measuring Performance" wiki page with what the community expects. I think you would have a lot to contribute to such a document. > Only benchmark where I would be certain with result would be take one of > numerical applications where is* is bottleneck that were mentioned > before and measure performance before and after. That is a good recommendation. When I worked in HPC the common phrase was "the only benchmark that matters is your application." :-) In glibc this is harder because we are looking for average-case performance across all applications. In some ways the optimization space will get both harder and easier as Siddhesh works on tunnables. Applications will be able to adjust runtime behaviour for performance. However, untuned general performance will still be just as difficult to optimze for. And then we get to talk about auto-tunning algorithms, and tunnings for given workloads. >> In this case if you believe catan can be used to test the performance in this >> patch, please suggest this to Wilco. However, I will not accept performance >> patches without justification for the performance gains. >> > Already wrote about that to get more precise answer. Thanks. >>> Carlos you talk lot about deciding objectively but when I ask you out >>> its never done. So I will ask you again to decide based on my previous >>> benchmark. There sometimes builtin is 20% faster and sometimes a current >>> inline is 20% faster. How do you imagine that experts would decide >>> solely on that instead of telling you that its inconclusive and you need >>> to do real world measurements or that benchmark is flawed because X? >> >> My apologies if I failed to do something you requested. >> >> I'm not interested in abstract examples, I'm interested in the patch being >> submitted by ARM. I will continue this discussion on the downstream thread >> that includes the microbenchmark written by Wilco. >> >> In general I expect there are certainly classes of performance problems >> that have inconclusive results. In which case I will reject that patch until >> you tell me how you measure the performance gains and what you expect the >> performance gains to be on average. >> > Problem is that most functions have inconclusive results as we use mined > cases where its possible and must use assumptions about input distribution to get speedup. > Simplest example is that in string functions a we assume that 64 bytes > cross page boundary only rarely and list of these assumptions goes on > and on. Then when microbenchmark violates one of these as its simplistic > truth is if that gives real speedup of programs not what microbenchmark > shows. I don't follow. Could you expand on your thought there? Cheers, Carlos.
diff --git a/math/Makefile b/math/Makefile index 9a3cf32..f78d75b 100644 --- a/math/Makefile +++ b/math/Makefile @@ -155,6 +155,7 @@ CFLAGS-test-tgmath.c = -fno-builtin CFLAGS-test-tgmath2.c = -fno-builtin CFLAGS-test-tgmath-ret.c = -fno-builtin CFLAGS-test-powl.c = -fno-builtin +CFLAGS-test-snan.c = -fsignaling-nans CPPFLAGS-test-ifloat.c = -U__LIBC_INTERNAL_MATH_INLINES -D__FAST_MATH__ \ -DTEST_FAST_MATH -fno-builtin CPPFLAGS-test-idouble.c = -U__LIBC_INTERNAL_MATH_INLINES -D__FAST_MATH__ \ diff --git a/math/math.h b/math/math.h index 22f0989..1721118 100644 --- a/math/math.h +++ b/math/math.h @@ -215,8 +215,15 @@ enum FP_NORMAL }; +/* GCC bug 66462 means we cannot use the math builtins with -fsignaling-nan, + so disable builtins if this is enabled. When fixed in a newer GCC, + the __SUPPORT_SNAN__ check may be skipped for those versions. */ + /* Return number of classification appropriate for X. */ -# ifdef __NO_LONG_DOUBLE_MATH +# if __GNUC_PREREQ (4,4) && !defined __SUPPORT_SNAN__ +# define fpclassify(x) __builtin_fpclassify (FP_NAN, FP_INFINITE, \ + FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x) +# elif defined __NO_LONG_DOUBLE_MATH # define fpclassify(x) \ (sizeof (x) == sizeof (float) ? __fpclassifyf (x) : __fpclassify (x)) # else @@ -229,32 +236,26 @@ enum /* Return nonzero value if sign of X is negative. */ # if __GNUC_PREREQ (4,0) -# ifdef __NO_LONG_DOUBLE_MATH -# define signbit(x) \ - (sizeof (x) == sizeof (float) \ - ? __builtin_signbitf (x) : __builtin_signbit (x)) -# else -# define signbit(x) \ - (sizeof (x) == sizeof (float) \ - ? __builtin_signbitf (x) \ - : sizeof (x) == sizeof (double) \ +# define signbit(x) \ + (sizeof (x) == sizeof (float) \ + ? __builtin_signbitf (x) \ + : sizeof (x) == sizeof (double) \ ? __builtin_signbit (x) : __builtin_signbitl (x)) -# endif -# else -# ifdef __NO_LONG_DOUBLE_MATH -# define signbit(x) \ +# elif defined __NO_LONG_DOUBLE_MATH +# define signbit(x) \ (sizeof (x) == sizeof (float) ? __signbitf (x) : __signbit (x)) -# else -# define signbit(x) \ +# else +# define signbit(x) \ (sizeof (x) == sizeof (float) \ ? __signbitf (x) \ : sizeof (x) == sizeof (double) \ ? __signbit (x) : __signbitl (x)) -# endif # endif /* Return nonzero value if X is not +-Inf or NaN. */ -# ifdef __NO_LONG_DOUBLE_MATH +# if __GNUC_PREREQ (4,4) && !defined __SUPPORT_SNAN__ +# define isfinite(x) __builtin_isfinite (x) +# elif defined __NO_LONG_DOUBLE_MATH # define isfinite(x) \ (sizeof (x) == sizeof (float) ? __finitef (x) : __finite (x)) # else @@ -266,11 +267,17 @@ enum # endif /* Return nonzero value if X is neither zero, subnormal, Inf, nor NaN. */ -# define isnormal(x) (fpclassify (x) == FP_NORMAL) +# if __GNUC_PREREQ (4,4) && !defined __SUPPORT_SNAN__ +# define isnormal(x) __builtin_isnormal (x) +# else +# define isnormal(x) (fpclassify (x) == FP_NORMAL) +# endif /* Return nonzero value if X is a NaN. We could use `fpclassify' but we already have this functions `__isnan' and it is faster. */ -# ifdef __NO_LONG_DOUBLE_MATH +# if __GNUC_PREREQ (4,4) && !defined __SUPPORT_SNAN__ +# define isnan(x) __builtin_isnan (x) +# elif defined __NO_LONG_DOUBLE_MATH # define isnan(x) \ (sizeof (x) == sizeof (float) ? __isnanf (x) : __isnan (x)) # else @@ -282,7 +289,9 @@ enum # endif /* Return nonzero value if X is positive or negative infinity. */ -# ifdef __NO_LONG_DOUBLE_MATH +# if __GNUC_PREREQ (4,4) && !defined __SUPPORT_SNAN__ +# define isinf(x) __builtin_isinf_sign (x) +# elif defined __NO_LONG_DOUBLE_MATH # define isinf(x) \ (sizeof (x) == sizeof (float) ? __isinff (x) : __isinf (x)) # else