Message ID | DB6PR0801MB2053788EB64591EC936E598683DE0@DB6PR0801MB2053.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [1/6] Remove slow paths from sin/cos | expand |
On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote: > This patch removes 2nd of the 3 range reduction cases and defer to the final one. > Input values above 2^27 are extremely rare, so this case doesn't need to as be > optimized as smaller inputs. Please elaborate on two more points: - The basis for your claim that values above 2^27 are extremely rare, i.e. papers that you may have referred that conclude this and/or applications you may have profiled - The quantum of performance loss due to dropping the 2nd case. Thanks, Siddhesh
On Fri, Mar 09, 2018 at 09:47:09PM +0530, Siddhesh Poyarekar wrote: > On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote: > > This patch removes 2nd of the 3 range reduction cases and defer to the final one. > > Input values above 2^27 are extremely rare, so this case doesn't need to as be > > optimized as smaller inputs. > > Please elaborate on two more points: > > - The basis for your claim that values above 2^27 are extremely rare, > i.e. papers that you may have referred that conclude this and/or > applications you may have profiled > Main reason is that for these inputs accuracy doesn't make a sense. There is already millions bigger error caused by limited precision when rounding input to float.
On Fri, Mar 9, 2018 at 1:19 PM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Fri, Mar 09, 2018 at 09:47:09PM +0530, Siddhesh Poyarekar wrote: >> On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote: >> > This patch removes 2nd of the 3 range reduction cases and defer to the final one. >> > Input values above 2^27 are extremely rare, so this case doesn't need to as be >> > optimized as smaller inputs. >> >> Please elaborate on two more points: >> >> - The basis for your claim that values above 2^27 are extremely rare, >> i.e. papers that you may have referred that conclude this and/or >> applications you may have profiled >> > Main reason is that for these inputs accuracy doesn't make a sense. > There is already millions bigger error caused by limited precision when > rounding input to float. More concretely, for IEEE single, the gap between representable values is bigger than 2π for values whose exponent is 2^26 or above. Since the sine function is periodic over 2π, that means the result of sinf() is effectively meaningless for any input at least that big - _any value_ within the input period could have been rounded to the representable, so the "true" answer could be anything. (I am tempted to suggest that we return NaN for inputs this large, without even bothering to do argument reduction.) For double, this happens at 2^55, and for x86-64 long double it happens at 2^66. I _think_ x86-64 long double is 80-bit IEEE extended, not quad, but I could be wrong. zw
Ondřej Bílka wrote: > On Fri, Mar 09, 2018 at 09:47:09PM +0530, Siddhesh Poyarekar wrote: >> On Friday 09 March 2018 09:16 PM, Wilco Dijkstra wrote: >> > This patch removes 2nd of the 3 range reduction cases and defer to the final one. >> > Input values above 2^27 are extremely rare, so this case doesn't need to as be >> > optimized as smaller inputs. >> >> Please elaborate on two more points: >> >> - The basis for your claim that values above 2^27 are extremely rare, >> i.e. papers that you may have referred that conclude this and/or >> applications you may have profiled >> > Main reason is that for these inputs accuracy doesn't make a sense. > There is already millions bigger error caused by limited precision when > rounding input to float. Exactly, it's rare for sin/cos input to be > 10, largest I've ever seen was around 1000. There is a case to be made for making wildly out or range inputs an error or just return 0.0 rather than using insanely accurate range reduction (the result is going to be equally wrong either way), but that's a different discussion. Fast range reduction of up to 2^27 is a huge overkill, we could get a significant speedup by reducing this range. The 2nd level range reduction (2^27 till 2^48) is so pointless that there is absolutely no reason keep it. There is about a 2.3x slowdown in this range due to __branred being braindead slow. Basically I can only see a reason for 3 levels of range reduction if the first range reduction only handles a tiny range (say up to 10), the final range reduction is extremely slow and we can find proof of an application relying on fast range reduction up to say 1000 or 1000000. I think a 3-4x speedup of __branred should be feasible using a much simpler algorithm and integer arithmetic. This will not only cover a much wider range but then we will also be sure it works correctly! Wilco
On Saturday 10 March 2018 12:36 AM, Wilco Dijkstra wrote: > Exactly, it's rare for sin/cos input to be > 10, largest I've ever seen was around 1000. > There is a case to be made for making wildly out or range inputs an error or just return 0.0 > rather than using insanely accurate range reduction (the result is going to be equally wrong > either way), but that's a different discussion. To be clear, I do not object to the patchset, it is the right way forward. I would like to see clearer documentation as to why we decided to drop this path in this patch and what the impact of that is. In that sense, it would be nice to have a better rationale in the commit log than "I've never seen inputs that big". > Fast range reduction of up to 2^27 is a huge overkill, we could get a significant speedup by > reducing this range. > > The 2nd level range reduction (2^27 till 2^48) is so pointless that there is absolutely no > reason keep it. There is about a 2.3x slowdown in this range due to __branred being > braindead slow. Thanks, all this should be part of the commit log. However, from what I understand, this particular patch will result in a regression (since IIRC reduce_and_compute is slower) but the following patch will fix that and improve things. Siddhesh
On Fri, 2018-03-09 at 13:52 -0500, Zack Weinberg wrote: > > More concretely, for IEEE single, the gap between representable values > is bigger than 2π for values whose exponent is 2^26 or above. Since > the sine function is periodic over 2π, that means the result of sinf() > is effectively meaningless for any input at least that big - _any > value_ within the input period could have been rounded to the > representable, so the "true" answer could be anything. (I am tempted > to suggest that we return NaN for inputs this large, without even > bothering to do argument reduction.) > > For double, this happens at 2^55, and for x86-64 long double it > happens at 2^66. I _think_ x86-64 long double is 80-bit IEEE > extended, not quad, but I could be wrong. > > zw And yet we have tests for large numbers in auto-libm-test-out-sin and auto-libm-test-out-cos. I was testing a different vector sin/cos implementation on aarch64 and I got some big diffs on very large output. Maybe we shouldn't test for these values if they are not of any value. testing double (vector length 2) Failure: Test: sin_vlen2 (-0x2p+64) Result: is: -0.0000000000000000e+00 -0x0.0000000000000p+0 should be: 4.7183876212354675e-02 0x1.8287c2a760e04p-5 difference: 4.7183876212354675e-02 0x1.8287c2a760e04p-5 ulp : 6799913194491396.0000 max.ulp : 1.0000 Failure: Test: sin_vlen2 (-0x3.3de320f6be87ep+1020) Result: is: -0.0000000000000000e+00 -0x0.0000000000000p+0 should be: -9.9219562491895441e-01 -0x1.fc0110a085bafp-1 difference: 9.9219562491895441e-01 0x1.fc0110a085bafp-1 ulp : 8936903693327279.0000 max.ulp : 1.0000
On Fri, 9 Mar 2018, Steve Ellcey wrote: > And yet we have tests for large numbers in auto-libm-test-out-sin > and auto-libm-test-out-cos. I was testing a different vector sin/cos Yes. A floating-point number represents a particular real number, not an interval, so all the usual accuracy goals (of results within a few ulps of the correct answer) apply for large inputs (but performance is not a concern for those inputs). Only for IBM long double is this relaxed, to treat values not representable in 106 mantissa bits as if they do represent intervals.
Joseph Myers wrote: > Yes. A floating-point number represents a particular real number, not an > interval, so all the usual accuracy goals (of results within a few ulps of > the correct answer) apply for large inputs (but performance is not a > concern for those inputs). Only for IBM long double is this relaxed, to > treat values not representable in 106 mantissa bits as if they do > represent intervals. Though these patches keep the ULP accuracy across the full range as is, we could agree on higher ULP errors for large/huge range reduction cases in the future. The main complexity is for certain rare inputs which happen to be extremely close to an integer multiple of PI/2, and those few cases mean you need significant extra work to guarantee 0.5 ULP error bound on range reduction. Wilco
On Mon, Mar 12, 2018 at 11:36 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Joseph Myers wrote: > >> Yes. A floating-point number represents a particular real number, not an >> interval, so all the usual accuracy goals (of results within a few ulps of >> the correct answer) apply for large inputs (but performance is not a >> concern for those inputs). Only for IBM long double is this relaxed, to >> treat values not representable in 106 mantissa bits as if they do >> represent intervals. > > Though these patches keep the ULP accuracy across the full range as is, > we could agree on higher ULP errors for large/huge range reduction cases > in the future. The main complexity is for certain rare inputs which happen to > be extremely close to an integer multiple of PI/2, and those few cases mean > you need significant extra work to guarantee 0.5 ULP error bound on range > reduction. I think 0.5 ULP error bound for values close to _small_ integer multiples of pi/2 is worth preserving. I'm not sure what the right cutoff is, but if we need to do that "significant extra work" for, like, values close to 10*pi and the most natural way to implement that also gives us the same error bound for values close to 1e6*pi, then I don't see why we shouldn't keep it for both. zw
On Mon, 12 Mar 2018, Wilco Dijkstra wrote: > Though these patches keep the ULP accuracy across the full range as is, > we could agree on higher ULP errors for large/huge range reduction cases > in the future. The main complexity is for certain rare inputs which happen to > be extremely close to an integer multiple of PI/2, and those few cases mean > you need significant extra work to guarantee 0.5 ULP error bound on range > reduction. I don't think the work for having an error bound not much more than 0.5ulp on the final result of sin/cos for large arguments is significantly different from the work for having an error bound of say 3ulp (the testsuite has a global maximum of 9ulp (16ulp for IBM long double) beyond which it will not accept errors even if those large errors are listed in libm-test-ulps files - that bound is simply based on the errors empirically observed at present, for functions other than Bessel functions and cpow which are known to have cases with much larger errors).
Siddhesh Poyarekar wrote: > On Saturday 10 March 2018 12:36 AM, Wilco Dijkstra wrote: >> Exactly, it's rare for sin/cos input to be > 10, largest I've ever seen was around 1000. >> There is a case to be made for making wildly out or range inputs an error or just return 0.0 >> rather than using insanely accurate range reduction (the result is going to be equally wrong >> either way), but that's a different discussion. > > To be clear, I do not object to the patchset, it is the right way > forward. I would like to see clearer documentation as to why we decided > to drop this path in this patch and what the impact of that is. In that > sense, it would be nice to have a better rationale in the commit log > than "I've never seen inputs that big". There are multiple reasons, correctness (the implementation wasn't accurate enough), premature optimization, complexity, etc. If we are to spend time on developing a better range reduction, the time is best spent on (a) common cases, ie. small values, and (b) on a much faster general range reducer (which could be designed to be faster on smaller inputs rather than be constant-time time __branred). A lot of the speedup is achieved by just making the code simpler. If we want to support 3-level range reduction, there must be a really good reason for it (I've not seen 3-level reduction used anywhere else). >> Fast range reduction of up to 2^27 is a huge overkill, we could get a significant speedup by >> reducing this range. >> >> The 2nd level range reduction (2^27 till 2^48) is so pointless that there is absolutely no >> reason keep it. There is about a 2.3x slowdown in this range due to __branred being >> braindead slow. > > Thanks, all this should be part of the commit log. However, from what I > understand, this particular patch will result in a regression (since > IIRC reduce_and_compute is slower) but the following patch will fix that > and improve things. I'll improve the commit log in the next version (assuming there are comments on the actual patch!). Wilco
Joseph Myers wrote: > On Mon, 12 Mar 2018, Wilco Dijkstra wrote: > >> Though these patches keep the ULP accuracy across the full range as is, >> we could agree on higher ULP errors for large/huge range reduction cases >> in the future. The main complexity is for certain rare inputs which happen to >> be extremely close to an integer multiple of PI/2, and those few cases mean >> you need significant extra work to guarantee 0.5 ULP error bound on range >> reduction. > > I don't think the work for having an error bound not much more than 0.5ulp > on the final result of sin/cos for large arguments is significantly > different from the work for having an error bound of say 3ulp (the > testsuite has a global maximum of 9ulp (16ulp for IBM long double) beyond > which it will not accept errors even if those large errors are listed in > libm-test-ulps files - that bound is simply based on the errors > empirically observed at present, for functions other than Bessel functions > and cpow which are known to have cases with much larger errors). Having to do another reduction step can increase latency significantly. To give an example, I had to increase accuracy of the first-level range reduction because a handful huge multiples of PI (6-7 digits) had 2-3ULP reduction error. The additional accuracy means range reduction became 50% slower while being way too accurate for almost all inputs. Obviously it would be possible redesign it from scratch using a much smaller input range and FMA when available etc. Wilco
On Monday 12 March 2018 11:38 PM, Wilco Dijkstra wrote: > I'll improve the commit log in the next version (assuming there are comments on the actual > patch!). Overall the patchset looks good to me but I don't think I'll have time to take a closer look any time this week or the next. In any case I will defer to Joseph for final approval of the patchset given his more extensive experience in the area. Siddhesh
diff --git a/sysdeps/ieee754/dbl-64/s_sin.c b/sysdeps/ieee754/dbl-64/s_sin.c index 9673a461ac592fc2bf3babc755dae336312e4c56..1f98e29278183d1fccd7c2b3fd467d6b16c245ed 100644 --- a/sysdeps/ieee754/dbl-64/s_sin.c +++ b/sysdeps/ieee754/dbl-64/s_sin.c @@ -362,80 +362,6 @@ do_sincos_1 (double a, double da, double x, int4 n, bool shift_quadrant) return retval; } -static inline int4 -__always_inline -reduce_sincos_2 (double x, double *a, double *da) -{ - mynumber v; - - double t = (x * hpinv + toint); - double xn = t - toint; - v.x = t; - double xn1 = (xn + 8.0e22) - 8.0e22; - double xn2 = xn - xn1; - double y = ((((x - xn1 * mp1) - xn1 * mp2) - xn2 * mp1) - xn2 * mp2); - int4 n = v.i[LOW_HALF] & 3; - double db = xn1 * pp3; - t = y - db; - db = (y - t) - db; - db = (db - xn2 * pp3) - xn * pp4; - double b = t + db; - db = (t - b) + db; - - *a = b; - *da = db; - - return n; -} - -/* Compute sin (A + DA). cos can be computed by passing SHIFT_QUADRANT as - true, which results in shifting the quadrant N clockwise. */ -static double -__always_inline -do_sincos_2 (double a, double da, double x, int4 n, bool shift_quadrant) -{ - double res, retval, cor, xx; - - double eps = 1.0e-24; - - int4 k = (n + shift_quadrant) & 3; - - switch (k) - { - case 2: - a = -a; - da = -da; - /* Fall through. */ - case 0: - xx = a * a; - if (xx < 0.01588) - { - /* Taylor series. */ - res = TAYLOR_SIN (xx, a, da, cor); - cor = 1.02 * cor + __copysign (eps, cor); - retval = (res == res + cor) ? res : bsloww (a, da, x, n); - } - else - { - res = do_sin (a, da, &cor); - cor = 1.035 * cor + __copysign (eps, cor); - retval = ((res == res + cor) ? __copysign (res, a) - : bsloww1 (a, da, x, n)); - } - break; - - case 1: - case 3: - res = do_cos (a, da, &cor); - cor = 1.025 * cor + __copysign (eps, cor); - retval = ((res == res + cor) ? ((n & 2) ? -res : res) - : bsloww2 (a, da, x, n)); - break; - } - - return retval; -} - /*******************************************************************/ /* An ultimate sin routine. Given an IEEE double machine number x */ /* it computes the correctly rounded (to nearest) value of sin(x) */ @@ -498,16 +424,7 @@ __sin (double x) retval = do_sincos_1 (a, da, x, n, false); } /* else if (k < 0x419921FB ) */ -/*---------------------105414350 <|x|< 281474976710656 --------------------*/ - else if (k < 0x42F00000) - { - double a, da; - - int4 n = reduce_sincos_2 (x, &a, &da); - retval = do_sincos_2 (a, da, x, n, false); - } /* else if (k < 0x42F00000 ) */ - -/* -----------------281474976710656 <|x| <2^1024----------------------------*/ +/* --------------------105414350 <|x| <2^1024------------------------------*/ else if (k < 0x7ff00000) retval = reduce_and_compute (x, false); @@ -584,15 +501,7 @@ __cos (double x) retval = do_sincos_1 (a, da, x, n, true); } /* else if (k < 0x419921FB ) */ - else if (k < 0x42F00000) - { - double a, da; - - int4 n = reduce_sincos_2 (x, &a, &da); - retval = do_sincos_2 (a, da, x, n, true); - } /* else if (k < 0x42F00000 ) */ - - /* 281474976710656 <|x| <2^1024 */ + /* 105414350 <|x| <2^1024 */ else if (k < 0x7ff00000) retval = reduce_and_compute (x, true); diff --git a/sysdeps/ieee754/dbl-64/s_sincos.c b/sysdeps/ieee754/dbl-64/s_sincos.c index e1977ea7e93c32cca5369677f23e68f8f797a9f4..a9af8ce526bfe78c06cfafa65de0815ec69585c5 100644 --- a/sysdeps/ieee754/dbl-64/s_sincos.c +++ b/sysdeps/ieee754/dbl-64/s_sincos.c @@ -86,16 +86,6 @@ __sincos (double x, double *sinx, double *cosx) return; } - if (k < 0x42F00000) - { - double a, da; - int4 n = reduce_sincos_2 (x, &a, &da); - - *sinx = do_sincos_2 (a, da, x, n, false); - *cosx = do_sincos_2 (a, da, x, n, true); - - return; - } if (k < 0x7ff00000) { reduce_and_compute_sincos (x, sinx, cosx);