From patchwork Tue Jun 26 11:18:23 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
X-Patchwork-Id: 934809
Return-Path: 
 <libc-alpha-return-93629-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=sourceware.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=libc-alpha-return-93629-incoming=patchwork.ozlabs.org@sourceware.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=arm.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	secure) header.d=sourceware.org header.i=@sourceware.org
	header.b="HJDGLas/";
	dkim=fail reason="signature verification failed" (1024-bit key;
	unprotected) header.d=armh.onmicrosoft.com
	header.i=@armh.onmicrosoft.com header.b="Z1JyI6RL";
	dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 41FNp85mXrz9s1R
	for <incoming@patchwork.ozlabs.org>;
	Tue, 26 Jun 2018 21:18:40 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id
	:content-type:content-transfer-encoding:mime-version; q=dns; s=
	default; b=Ts+0D29EpSMp5G59PneFeaFgnRbrmgn/aQS6tVVJoMECHnaYXWXkL
	5S8+jh+I5x3o/EhiJsL6dSMi7G36aMEGFpU3vl7OwWJtsuKOCPX+C32JuACx/3Oy
	fgRmAWhsVmtQYUcT1YwtXSIlovwjOlK3y4q321OFzPCJ/cIQzbEVnM=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id
	:content-type:content-transfer-encoding:mime-version; s=default;
	bh=ItwvfNRT9plwTEEBkJaKddfoad4=; b=HJDGLas/0vW1zwtEA8iXNS46Iqsu
	1DNvozn53jQGTcSx+MXhc5nQhDUYD0BZeWaemFg46IOKHw4AZu879EL48J5OxkGW
	872GlQw4Fyu1ISmd/UHBvlR+T/LYKBpoMNN/qDElacwIPtwGnkQJTL4aKEDFsT2e
	to6cAth9X6kyh3c=
Received: (qmail 48166 invoked by alias); 26 Jun 2018 11:18:34 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 48132 invoked by uid 89); 26 Jun 2018 11:18:34 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-25.1 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_LOTSOFHASH, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS,
	SPF_PASS autolearn=ham version=3.3.2 spammy=tiny
X-HELO: EUR01-HE1-obe.outbound.protection.outlook.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
	s=selector1-arm-com;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=JQJuWKrg8FDwpGLgcblnYaYEg/H4JnFSwveqNm5leXM=;
	b=Z1JyI6RL0y9cKlTlJMi1kA08DjYTOky/ZvHJIfF9NApNa3/OBHY1dj/3Y8bu4qMEpk4CAP6BLcQjRtNVxS4U9Hu9yDzk723L5f3NOaOWbV+VL5hlDDauTSMprfGM4/dn/oUut14EIZhyxrPHc1e1W4DHr3wMTEhyYYnBzA7JCVM=
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
CC: nd <nd@arm.com>
Subject: [PATCH1/2] Improve performance of sincosf
Date: Tue, 26 Jun 2018 11:18:23 +0000
Message-ID: 
 <DB5PR08MB10301908C58D8A9DD1C847BA83490@DB5PR08MB1030.eurprd08.prod.outlook.com>
authentication-results: spf=none (sender IP is )
	smtp.mailfrom=Wilco.Dijkstra@arm.com;
received-spf: None (protection.outlook.com: arm.com does not designate
	permitted sender hosts)
MIME-Version: 1.0

This patch is a complete rewrite of sinf, cosf and sincosf.  The new version
is significantly faster, as well as simple and accurate. 
The worst-case ULP is 0.56072, maximum relative error is 0.5303p-23 over all
4 billion inputs.  In non-nearest rounding modes the error is 1ULP.

The algorithm uses 3 main cases: small inputs which don't need argument
reduction, small inputs which need a simple range reduction and large inputs
requiring complex range reduction.  The code uses approximate integer
comparisons to quickly decide between these cases - on some targets this may
be slow, so this can be configured to use floating point comparisons.

The small range reducer uses a single reduction step to handle values up to
120.0.  It is fastest on targets which support inlined round instructions.

The large range reducer uses integer arithmetic for simplicity.  It does a
32x96 bit multiply to compute a 64-bit modulo result.  This is more than
accurate enough to handle the worst-case cancellation for values close to
an integer multiple of PI/4.  It could be further optimized, however it is
already much faster than necessary.

sincosf throughput gains on Cortex-A72:
* |x| < 0x1p-12 : 1.6x
* |x| < M_PI_4  : 1.7x
* |x| < 2 * M_PI: 1.5x
* |x| < 120.0   : 1.8x
* |x| < Inf     : 2.3x

On a benchmark with significant use of sincosf the overall speedup is >33%.

ChangeLog:
2018-04-16  Wilco Dijkstra  <wdijkstr@arm.com>

	* math/Makefile: Add s_sincosf_data.c.
	* sysdeps/aarch64/fpu/math_private.h (roundtoint): Use round.
	(converttoint): Use lround.
	* sysdeps/ieee754/flt-32/math_config.h (PREFER_FLOAT_COMPARISON): Define.
	* sysdeps/ieee754/flt-32/s_sincosf.h (abstop12): Add new function.
	(sincosf_poly): Likewise.
	(reduce_small): Likewise.
	(reduce_large): Likewise.
	* sysdeps/ieee754/flt-32/s_sincosf.c (sincosf): Rewrite.
	* sysdeps/ieee754/flt-32/s_sincosf_data.c: New file with sincosf data.
diff --git a/math/Makefile b/math/Makefile
index 7cc0d16b0c83562589f1ae29fe01e45a377a2c97..69180e3c84b14c5815c9321ea1eac46d1515f0a1 100644
--- a/math/Makefile
+++ b/math/Makefile
@@ -130,7 +130,7 @@ type-double-routines := branred doasin dosincos mpa mpatan2	\
 # float support
 type-float-suffix := f
 type-float-routines := k_rem_pio2f math_errf e_exp2f_data e_logf_data	\
-		       e_log2f_data e_powf_log2_data
+		       e_log2f_data e_powf_log2_data s_sincosf_data
 
 # _Float128 support
 type-float128-suffix := f128
diff --git a/sysdeps/aarch64/fpu/math_private.h b/sysdeps/aarch64/fpu/math_private.h
index d9c2d710a95ebe76b25c9dc86ed0bea729a3a6cd..1a8bb66e55124416fa8f554a2e71f0a438abd40e 100644
--- a/sysdeps/aarch64/fpu/math_private.h
+++ b/sysdeps/aarch64/fpu/math_private.h
@@ -21,6 +21,8 @@
 
 #include <fenv.h>
 #include <fpu_control.h>
+#include <stdint.h>
+#include <math.h>
 
 #define math_opt_barrier(x) \
 ({ __typeof (x) __x = (x); __asm ("" : "+w" (__x)); __x; })
@@ -303,26 +305,20 @@ libc_feresetround_noex_aarch64_ctx (struct rm_ctx *ctx)
 #define libc_feresetround_noexf_ctx	libc_feresetround_noex_aarch64_ctx
 #define libc_feresetround_noexl_ctx	libc_feresetround_noex_aarch64_ctx
 
-/* Hack: only include the large arm_neon.h when needed.  */
-#ifdef _MATH_CONFIG_H
-# include <arm_neon.h>
-
 /* ACLE intrinsics for frintn and fcvtns instructions.  */
 # define TOINT_INTRINSICS 1
 
 static inline double_t
 roundtoint (double_t x)
 {
-  return vget_lane_f64 (vrndn_f64 (vld1_f64 (&x)), 0);
+  return round (x);
 }
 
 static inline uint64_t
 converttoint (double_t x)
 {
-  return vcvtnd_s64_f64 (x);
+  return (uint64_t) lround (x);
 }
 #endif
 
 #include_next <math_private.h>
-
-#endif
diff --git a/sysdeps/ieee754/flt-32/math_config.h b/sysdeps/ieee754/flt-32/math_config.h
index c4def9bbc1898d71865572d0084b219034bf3805..b68fc47acc811d9e9844faac1ca71ec3b4d7de41 100644
--- a/sysdeps/ieee754/flt-32/math_config.h
+++ b/sysdeps/ieee754/flt-32/math_config.h
@@ -46,6 +46,9 @@
 #ifndef TOINT_SHIFT
 # define TOINT_SHIFT 1
 #endif
+#ifndef PREFER_FLOAT_COMPARISON
+#define PREFER_FLOAT_COMPARISON 0
+#endif
 
 static inline uint32_t
 asuint (float f)
diff --git a/sysdeps/ieee754/flt-32/s_sincosf.h b/sysdeps/ieee754/flt-32/s_sincosf.h
index 35b5eee536ef0ca55e68e3d3a091486dca39d36a..9da17dc7fe3e1d494164edbf26001303343e7c34 100644
--- a/sysdeps/ieee754/flt-32/s_sincosf.h
+++ b/sysdeps/ieee754/flt-32/s_sincosf.h
@@ -16,6 +16,10 @@
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
+#include <stdint.h>
+#include <math.h>
+#include "math_config.h"
+
 /* Chebyshev constants for cos, range -PI/4 - PI/4.  */
 static const double C0 = -0x1.ffffffffe98aep-2;
 static const double C1 =  0x1.55555545c50c7p-5;
@@ -153,3 +157,118 @@ reduced_cos (double theta, unsigned int n)
     }
   return sign * cx;
 }
+
+/* New sincosf.  */
+
+/* PI * 2^-64.  */
+static const double pi64 = 0x1.921FB54442D18p-62;
+/* PI / 4.  */
+static const double pio4 = 0x1.921FB54442D18p-1;
+
+typedef const struct
+{
+  double sign[4];
+  double hpi_inv, hpi, c0, c1, c2, c3, c4, s1, s2, s3;
+} sincos_t;
+
+extern sincos_t sincosf_table[2] attribute_hidden;
+
+extern const uint32_t inv_pio4[] attribute_hidden;
+
+/* abstop12 assumes floating point reinterpret is fast by default.
+   If floating point comparisons are faster, define PREFER_FLOAT_COMPARISON.  */
+#if PREFER_FLOAT_COMPARISON
+static inline float
+abstop12 (float x)
+{
+  return fabsf (x);
+}
+#else
+static inline uint32_t
+abstop12 (float x)
+{
+  return (asuint (x) >> 20) & 0x7ff;
+}
+#endif
+
+/* Compute the sine and cosine of inputs X and X2 (X squared), using the
+   polynomial P and store the results in SINP and COSP.  N is the quadrant,
+   if odd the cosine and sine polynomials are swapped.  */
+static inline void
+sincosf_poly (double x, double x2, sincos_t *p, int n, float *sinp, float *cosp)
+{
+  double x3, x4, x5, x6, s, c, c1, c2, s1;
+
+  x4 = x2 * x2;
+  x3 = x2 * x;
+  c2 = p->c3 + x2 * p->c4;
+  s1 = p->s2 + x2 * p->s3;
+
+  /* Swap sin/cos result based on quadrant.  */
+  float *tmp = (n & 1 ? cosp : sinp);
+  cosp = (n & 1 ? sinp : cosp);
+  sinp = tmp;
+
+  c1 = p->c0 + x2 * p->c1;
+  x5 = x3 * x2;
+  x6 = x4 * x2;
+
+  s = x + x3 * p->s1;
+  c = c1 + x4 * p->c2;
+
+  *sinp = s + x5 * s1;
+  *cosp = c + x6 * c2;
+}
+
+/* Fast range reduction using single multiply-subtract.  Return the modulo of
+   X as a value between -PI/4 and PI/4 and store the quadrant in NP.
+   The values for PI/2 and 2/PI are accessed via P.  Since PI/2 as a double
+   is accurate to 55 bits and the worst-case cancellation happens at 6 * PI/4,
+   only 2 multiplies are required and the result is accurate for |X| <= 120.0.
+   Use round/lround if inlined, otherwise convert to int.  To avoid inaccuracies
+   introduced by truncating negative values, compute the quadrant * 2^24.  */
+static inline double
+reduce_fast (double x, sincos_t *p, int *np)
+{
+  double r;
+#if TOINT_INTRINSICS
+  r = x * p->hpi_inv;
+  *np = converttoint (r);
+  return x - roundtoint (r) * p->hpi;
+#else
+  r = x * p->hpi_inv;
+  int n = ((int32_t)r + 0x800000) >> 24;
+  *np = n;
+  return x - n * p->hpi;
+#endif
+}
+
+/* Reduce the range of XI to a multiple of PI/4 using fast integer arithmetic.
+   XI is a reinterpreted float and must be >= 2.0f (the sign bit is ignored).
+   Return the modulo between -PI/4 and PI/4 and store the quadrant in NP.
+   Reduction uses a table of 4/PI with 192 bits of precision.  A 32x96->128 bit
+   multiply computes the exact 2.62-bit fixed-point modulo.  Since the result
+   can have at most 29 leading zeros after the binary point, the double
+   precision result is accurate to 33 bits.  */
+static inline double
+reduce_large (uint32_t xi, int *np)
+{
+  const uint32_t *arr = &inv_pio4[(xi >> 26) & 15];
+  int shift = (xi >> 23) & 7;
+  uint64_t n, res0, res1, res2;
+
+  xi = (xi & 0xffffff) | 0x800000;
+  xi <<= shift;
+
+  res0 = xi * arr[0];
+  res1 = (uint64_t)xi * arr[4];
+  res2 = (uint64_t)xi * arr[8];
+  res0 = (res2 >> 32) | (res0 << 32);
+  res0 += res1;
+
+  n = (res0 + (1ULL << 61)) >> 62;
+  res0 -= n << 62;
+  double x = (int64_t)res0;
+  *np = n;
+  return x * pi64;
+}
diff --git a/sysdeps/ieee754/flt-32/s_sincosf.c b/sysdeps/ieee754/flt-32/s_sincosf.c
index d4a5a1b22c2f9a5cf607cd2d31157dd1dbd14e9a..629673cbd37ec863c96aaebf2393e0722c2aad9c 100644
--- a/sysdeps/ieee754/flt-32/s_sincosf.c
+++ b/sysdeps/ieee754/flt-32/s_sincosf.c
@@ -1,5 +1,5 @@
 /* Compute sine and cosine of argument.
-   Copyright (C) 2017-2018 Free Software Foundation, Inc.
+   Copyright (C) 2018 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -17,9 +17,11 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <errno.h>
+#include <stdint.h>
 #include <math.h>
 #include <math_private.h>
 #include <libm-alias-float.h>
+#include "math_config.h"
 #include "s_sincosf.h"
 
 #ifndef SINCOSF
@@ -28,141 +30,72 @@
 # define SINCOSF_FUNC SINCOSF
 #endif
 
+/* Fast sincosf implementation.  Worst-case ULP is 0.56072, maximum relative
+   error is 0.5303p-23.  A single-step signed range reduction is used for
+   small values.  Large inputs have their range reduced using fast integer
+   arithmetic.
+*/
 void
-SINCOSF_FUNC (float x, float *sinx, float *cosx)
+SINCOSF_FUNC (float y, float *sinp, float *cosp)
 {
-  double cx;
-  double theta = x;
-  double abstheta = fabs (theta);
-  /* If |x|< Pi/4.  */
-  if (isless (abstheta, M_PI_4))
+  double x = y;
+  double s;
+  int n;
+  sincos_t *p = &sincosf_table[0];
+
+  if (abstop12 (y) < abstop12 (pio4))
+    {
+      double x2 = x * x;
+
+      if (__glibc_unlikely (abstop12 (y) < abstop12 (0x1p-12f)))
+      {
+	/* Force underflow for tiny y.  */
+	if (__glibc_unlikely (abstop12 (y) < abstop12 (0x1p-126f)))
+	  math_force_eval ((float)x2);
+	*sinp = y;
+	*cosp = 1.0f;
+	return;
+      }
+
+      sincosf_poly (x, x2, p, 0, sinp, cosp);
+    }
+  else if (abstop12 (y) < abstop12 (120.0f))
     {
-      if (abstheta >= 0x1p-5) /* |x| >= 2^-5.  */
-	{
-	  const double theta2 = theta * theta;
-	  /* Chebyshev polynomial of the form for sin and cos.  */
-	  cx = C3 + theta2 * C4;
-	  cx = C2 + theta2 * cx;
-	  cx = C1 + theta2 * cx;
-	  cx = C0 + theta2 * cx;
-	  cx = 1.0 + theta2 * cx;
-	  *cosx = cx;
-	  cx = S3 + theta2 * S4;
-	  cx = S2 + theta2 * cx;
-	  cx = S1 + theta2 * cx;
-	  cx = S0 + theta2 * cx;
-	  cx = theta + theta * theta2 * cx;
-	  *sinx = cx;
-	}
-      else if (abstheta >= 0x1p-27)     /* |x| >= 2^-27.  */
-	{
-	  /* A simpler Chebyshev approximation is close enough for this range:
-	     for sin: x+x^3*(SS0+x^2*SS1)
-	     for cos: 1.0+x^2*(CC0+x^3*CC1).  */
-	  const double theta2 = theta * theta;
-	  cx = CC0 + theta * theta2 * CC1;
-	  cx = 1.0 + theta2 * cx;
-	  *cosx = cx;
-	  cx = SS0 + theta2 * SS1;
-	  cx = theta + theta * theta2 * cx;
-	  *sinx = cx;
-	}
-      else
-	{
-	  /* Handle some special cases.  */
-	  if (theta)
-	    *sinx = theta - (theta * SMALL);
-	  else
-	    *sinx = theta;
-	  *cosx = 1.0 - abstheta;
-	}
+      x = reduce_fast (x, p, &n);
+
+      /* Setup the signs for sin and cos.  */
+      s = p->sign[n & 3];
+
+      if (n & 2)
+	p = &sincosf_table[1];
+
+      sincosf_poly (x * s, x * x, p, n, sinp, cosp);
     }
-  else                          /* |x| >= Pi/4.  */
+  else if (__glibc_likely (abstop12 (y) < abstop12 (INFINITY)))
     {
-      unsigned int signbit = isless (x, 0);
-      if (isless (abstheta, 9 * M_PI_4))        /* |x| < 9*Pi/4.  */
-	{
-	  /* There are cases where FE_UPWARD rounding mode can
-	     produce a result of abstheta * inv_PI_4 == 9,
-	     where abstheta < 9pi/4, so the domain for
-	     pio2_table must go to 5 (9 / 2 + 1).  */
-	  unsigned int n = (abstheta * inv_PI_4) + 1;
-	  theta = abstheta - pio2_table[n / 2];
-	  *sinx = reduced_sin (theta, n, signbit);
-	  *cosx = reduced_cos (theta, n);
-	}
-      else if (isless (abstheta, INFINITY))
-	{
-	  if (abstheta < 0x1p+23)     /* |x| < 2^23.  */
-	    {
-	      unsigned int n = ((unsigned int) (abstheta * inv_PI_4)) + 1;
-	      double x = n / 2;
-	      theta = (abstheta - x * PI_2_hi) - x * PI_2_lo;
-	      /* Argument reduction needed.  */
-	      *sinx = reduced_sin (theta, n, signbit);
-	      *cosx = reduced_cos (theta, n);
-	    }
-	  else                  /* |x| >= 2^23.  */
-	    {
-	      x = fabsf (x);
-	      int exponent;
-	      GET_FLOAT_WORD (exponent, x);
-	      exponent
-	        = (exponent >> FLOAT_EXPONENT_SHIFT) - FLOAT_EXPONENT_BIAS;
-	      exponent += 3;
-	      exponent /= 28;
-	      double a = invpio4_table[exponent] * x;
-	      double b = invpio4_table[exponent + 1] * x;
-	      double c = invpio4_table[exponent + 2] * x;
-	      double d = invpio4_table[exponent + 3] * x;
-	      uint64_t l = a;
-	      l &= ~0x7;
-	      a -= l;
-	      double e = a + b;
-	      l = e;
-	      e = a - l;
-	      if (l & 1)
-	        {
-	          e -= 1.0;
-	          e += b;
-	          e += c;
-	          e += d;
-	          e *= M_PI_4;
-		  *sinx = reduced_sin (e, l + 1, signbit);
-		  *cosx = reduced_cos (e, l + 1);
-	        }
-	      else
-		{
-		  e += b;
-		  e += c;
-		  e += d;
-		  if (e <= 1.0)
-		    {
-		      e *= M_PI_4;
-		      *sinx = reduced_sin (e, l + 1, signbit);
-		      *cosx = reduced_cos (e, l + 1);
-		    }
-		  else
-		    {
-		      l++;
-		      e -= 2.0;
-		      e *= M_PI_4;
-		      *sinx = reduced_sin (e, l + 1, signbit);
-		      *cosx = reduced_cos (e, l + 1);
-		    }
-		}
-	    }
-	}
-      else
-	{
-	  int32_t ix;
-	  /* High word of x.  */
-	  GET_FLOAT_WORD (ix, abstheta);
-	  /* sin/cos(Inf or NaN) is NaN.  */
-	  *sinx = *cosx = x - x;
-	  if (ix == 0x7f800000)
-	    __set_errno (EDOM);
-	}
+      uint32_t xi = asuint (y);
+      int sign = xi >> 31;
+
+      x = reduce_large (xi, &n);
+
+      /* Setup signs for sin and cos - include original sign.  */
+      s = p->sign[(n + sign) & 3];
+
+      if ((n + sign) & 2)
+	p = &sincosf_table[1];
+
+      sincosf_poly (x * s, x * x, p, n, sinp, cosp);
+    }
+  else
+    {
+      /* Return NaN if Inf or NaN for both sin and cos.  */
+      *sinp = *cosp = y - y;
+#if WANT_ERRNO
+      /* Needed to set errno for +-Inf, the add is a hack to work
+	 around a gcc register allocation issue: just passing y
+	 affects code generation in the fast path.  */
+      __math_invalidf (y + y);
+#endif
     }
 }
 
diff --git a/sysdeps/ieee754/flt-32/s_sincosf_data.c b/sysdeps/ieee754/flt-32/s_sincosf_data.c
new file mode 100644
index 0000000000000000000000000000000000000000..02eacf29ddf55ea93aba250a45e0569303239b3a
--- /dev/null
+++ b/sysdeps/ieee754/flt-32/s_sincosf_data.c
@@ -0,0 +1,74 @@
+/* Compute sine and cosine of argument.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdint.h>
+#include <math.h>
+#include "math_config.h"
+#include "s_sincosf.h"
+
+/* The constants and polynomials for sine and cosine.  The 2nd entry
+   computes -cos (x) rather than cos (x) to get negation for free.  */
+sincos_t sincosf_table[2] =
+{
+  {
+    { 1.0, -1.0, -1.0, 1.0 },
+#if TOINT_INTRINSICS
+    0x1.45F306DC9C883p-1,
+#else
+    0x1.45F306DC9C883p+23,
+#endif
+    0x1.921FB54442D18p0,
+    0x1p0,
+    -0x1.ffffffd0c621cp-2,
+    0x1.55553e1068f19p-5,
+    -0x1.6c087e89a359dp-10,
+    0x1.99343027bf8c3p-16,
+    -0x1.555545995a603p-3,
+    0x1.1107605230bc4p-7,
+    -0x1.994eb3774cf24p-13
+  },
+  {
+    { 1.0, -1.0, -1.0, 1.0 },
+#if TOINT_INTRINSICS
+    0x1.45F306DC9C883p-1,
+#else
+    0x1.45F306DC9C883p+23,
+#endif
+    0x1.921FB54442D18p0,
+    -0x1p0,
+    0x1.ffffffd0c621cp-2,
+    -0x1.55553e1068f19p-5,
+    0x1.6c087e89a359dp-10,
+    -0x1.99343027bf8c3p-16,
+    -0x1.555545995a603p-3,
+    0x1.1107605230bc4p-7,
+    -0x1.994eb3774cf24p-13
+  }
+};
+
+/* Table with 4/PI to 192 bit precision.  To avoid unaligned accesses
+   only 8 new bits are added per entry, making the table 4 times larger.  */
+const uint32_t inv_pio4[24] =
+{
+  0xa2,       0xa2f9,	  0xa2f983,   0xa2f9836e,
+  0xf9836e4e, 0x836e4e44, 0x6e4e4415, 0x4e441529,
+  0x441529fc, 0x1529fc27, 0x29fc2757, 0xfc2757d1,
+  0x2757d1f5, 0x57d1f534, 0xd1f534dd, 0xf534ddc0,
+  0x34ddc0db, 0xddc0db62, 0xc0db6295, 0xdb629599,
+  0x6295993c, 0x95993c43, 0x993c4390, 0x3c439041
+};

From patchwork Tue Jun 26 11:20:43 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
X-Patchwork-Id: 934810
Return-Path: 
 <libc-alpha-return-93630-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=sourceware.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=libc-alpha-return-93630-incoming=patchwork.ozlabs.org@sourceware.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=arm.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	secure) header.d=sourceware.org header.i=@sourceware.org
	header.b="X7GjLPHN";
	dkim=fail reason="signature verification failed" (1024-bit key;
	unprotected) header.d=armh.onmicrosoft.com
	header.i=@armh.onmicrosoft.com header.b="WSFIKaul";
	dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 41FNrr0Lmcz9ryk
	for <incoming@patchwork.ozlabs.org>;
	Tue, 26 Jun 2018 21:20:59 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id
	:content-type:content-transfer-encoding:mime-version; q=dns; s=
	default; b=KKy0InYo7q0Hn9ERycBDNjhQpRAcFS+J4/bqE9WXGNr+xXGMb/qVC
	TFs5sG1+Ihe4jJWhzQ+l+mmPb19nhzO+z3DEk8/jHCQF5IZfE6QMjlSJAi9S5qOK
	TyuyqOhCqL1V8BOZTqOoEwMAm1k85kXhysgFJyCCrQjF8V9u9GKppY=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id
	:content-type:content-transfer-encoding:mime-version; s=default;
	bh=dqpavVYKWZ9i9LvuyPdkKfYqtOM=; b=X7GjLPHNjSgAeQ3B0y5+9hXTVMDa
	SHbRJus0Vn54N0dY5bkgVjWm1DLEovqiZ6oxFQovC8a1o7RpLVRtwWu8hQAk0oga
	vG2EwX1n7ec6c7RNmcGczOE9mvcHqTO4wSwRdi2rUkt2Mzbn9Dndi1qI38frc68U
	m65XF6CpzGfQurM=
Received: (qmail 30402 invoked by alias); 26 Jun 2018 11:20:53 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 30383 invoked by uid 89); 26 Jun 2018 11:20:51 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-25.1 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_LOTSOFHASH, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS,
	SPF_PASS autolearn=ham version=3.3.2 spammy=
X-HELO: EUR03-DB5-obe.outbound.protection.outlook.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
	s=selector1-arm-com;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=kAFdsmjrVtd9sIRii0hwwcaFX2QDD3ewDFl9688dQ5E=;
	b=WSFIKaulSzhqvvWF3RbcXj4y0Tz8s9ZiJ512e4l+ahMcoQjQkF9wPhjctmSKZiLj7pN2X9ZZu+GzYVJOU1dJdVKiwuNI16kuLUUNEmKC5oY5VnCPiMwZto4UoAZQJI4ZyrY4S35L6gRDMr15Q4E/NEpkhbbLDZqZnlXalb/QZ9M=
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: "libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
CC: nd <nd@arm.com>
Subject: [PATCH2/2] Improve performance of sinf and cosf
Date: Tue, 26 Jun 2018 11:20:43 +0000
Message-ID: 
 <DB5PR08MB1030B80CCD5EC98114DFF97883490@DB5PR08MB1030.eurprd08.prod.outlook.com>
authentication-results: spf=none (sender IP is )
	smtp.mailfrom=Wilco.Dijkstra@arm.com;
received-spf: None (protection.outlook.com: arm.com does not designate
	permitted sender hosts)
MIME-Version: 1.0

The second patch improves performance of sinf and cosf using the same
algorithms and polynomials.  The returned values are identical to sincosf
for the same input.  ULP definitions for AArch64 and x64 are updated.

sinf/cosf througput gains on Cortex-A72:
* |x| < 0x1p-12 : 1.2x
* |x| < M_PI_4  : 1.8x
* |x| < 2 * M_PI: 1.7x
* |x| < 120.0   : 2.3x
* |x| < Inf     : 3.0x

ChangeLog:
2018-04-16  Wilco Dijkstra  <wdijkstr@arm.com>

	* sysdeps/aarch64/libm-test-ulps: Update ULP for sinf, cosf, sincosf.
	* sysdeps/x86_64/fpu/libm-test-ulps: Update ULP for sinf and cosf.
	* sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c: Add definitions of
	constants rather than including generic sincosf.h.
	* sysdeps/ieee754/flt-32/s_cosf.c (cosf): Rewrite.
	* sysdeps/ieee754/flt-32/s_sincosf.h (reduced_sin): Remove.
	(reduced_cos): Remove.
	(sinf_poly): New function.
	* sysdeps/ieee754/flt-32/s_sinf.c (sinf): Rewrite.

diff --git a/sysdeps/aarch64/libm-test-ulps b/sysdeps/aarch64/libm-test-ulps
index be06085154db24c8fd6cf1bce417028a959aaa27..6a890284597f545db0d56350795bec0f042148e5 100644
--- a/sysdeps/aarch64/libm-test-ulps
+++ b/sysdeps/aarch64/libm-test-ulps
@@ -1021,9 +1021,9 @@ ldouble: 1
 
 Function: "cos_downward":
 double: 1
-float: 2
+float: 1
 idouble: 1
-ifloat: 2
+ifloat: 1
 ildouble: 3
 ldouble: 3
 
@@ -1037,9 +1037,9 @@ ldouble: 1
 
 Function: "cos_upward":
 double: 1
-float: 2
+float: 1
 idouble: 1
-ifloat: 2
+ifloat: 1
 ildouble: 2
 ldouble: 2
 
@@ -1981,9 +1981,9 @@ ldouble: 1
 
 Function: "sin_downward":
 double: 1
-float: 2
+float: 1
 idouble: 1
-ifloat: 2
+ifloat: 1
 ildouble: 3
 ldouble: 3
 
@@ -1997,9 +1997,9 @@ ldouble: 2
 
 Function: "sin_upward":
 double: 1
-float: 2
+float: 1
 idouble: 1
-ifloat: 2
+ifloat: 1
 ildouble: 3
 ldouble: 3
 
@@ -2013,9 +2013,9 @@ ldouble: 1
 
 Function: "sincos_downward":
 double: 1
-float: 2
+float: 1
 idouble: 1
-ifloat: 2
+ifloat: 1
 ildouble: 3
 ldouble: 3
 
@@ -2029,9 +2029,9 @@ ldouble: 2
 
 Function: "sincos_upward":
 double: 1
-float: 2
+float: 1
 idouble: 1
-ifloat: 2
+ifloat: 1
 ildouble: 3
 ldouble: 3
 
diff --git a/sysdeps/ieee754/flt-32/s_cosf.c b/sysdeps/ieee754/flt-32/s_cosf.c
index 061264d2596abaccf576eff8efc7b7180822c581..f0705cf0887ecdf5d3be32b63e2b4b48bc7d2c39 100644
--- a/sysdeps/ieee754/flt-32/s_cosf.c
+++ b/sysdeps/ieee754/flt-32/s_cosf.c
@@ -1,5 +1,5 @@
 /* Compute cosine of argument.
-   Copyright (C) 2017-2018 Free Software Foundation, Inc.
+   Copyright (C) 2018 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -16,10 +16,11 @@
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
-#include <errno.h>
+#include <stdint.h>
 #include <math.h>
 #include <math_private.h>
 #include <libm-alias-float.h>
+#include "math_config.h"
 #include "s_sincosf.h"
 
 #ifndef COSF
@@ -28,121 +29,57 @@
 # define COSF_FUNC COSF
 #endif
 
+/* Fast cosf implementation.  Worst-case ULP is 0.56072, maximum relative
+   error is 0.5303p-23.  A single-step signed range reduction is used for
+   small values.  Large inputs have their range reduced using fast integer
+   arithmetic.
+*/
 float
-COSF_FUNC (float x)
+COSF_FUNC (float y)
 {
-  double theta = x;
-  double abstheta = fabs (theta);
-  if (isless (abstheta, M_PI_4))
+  double x = y;
+  double s;
+  int n;
+  sincos_t *p = &sincosf_table[0];
+
+  if (abstop12 (y) < abstop12 (pio4))
+    {
+      double x2 = x * x;
+
+      if (__glibc_unlikely (abstop12 (y) < abstop12 (0x1p-12f)))
+	return 1.0f;
+
+      return sinf_poly (x, x2, p, 1);
+    }
+  else if (__glibc_likely (abstop12 (y) < abstop12 (120.0f)))
     {
-      double cx;
-      if (abstheta >= 0x1p-5)
-	{
-	  const double theta2 = theta * theta;
-	  /* Chebyshev polynomial of the form for cos:
-	   * 1 + x^2 (C0 + x^2 (C1 + x^2 (C2 + x^2 (C3 + x^2 * C4)))).  */
-	  cx = C3 + theta2 * C4;
-	  cx = C2 + theta2 * cx;
-	  cx = C1 + theta2 * cx;
-	  cx = C0 + theta2 * cx;
-	  cx = 1. + theta2 * cx;
-	  return cx;
-	}
-      else if (abstheta >= 0x1p-27)
-	{
-	  /* A simpler Chebyshev approximation is close enough for this range:
-	   * 1 + x^2 (CC0 + x^3 * CC1).  */
-	  const double theta2 = theta * theta;
-	  cx = CC0 + theta * theta2 * CC1;
-	  cx = 1.0 + theta2 * cx;
-	  return cx;
-	}
-      else
-	{
-	  /* For small enough |theta|, this is close enough.  */
-	  return 1.0 - abstheta;
-	}
+      x = reduce_fast (x, p, &n);
+
+      /* Setup the signs for sin and cos.  */
+      s = p->sign[n & 3];
+
+      if (n & 2)
+	p = &sincosf_table[1];
+
+      return sinf_poly (x * s, x * x, p, n ^ 1);
     }
-  else /* |theta| >= Pi/4.  */
+  else if (abstop12 (y) < abstop12 (INFINITY))
     {
-      if (isless (abstheta, 9 * M_PI_4))
-	{
-	  /* There are cases where FE_UPWARD rounding mode can
-	     produce a result of abstheta * inv_PI_4 == 9,
-	     where abstheta < 9pi/4, so the domain for
-	     pio2_table must go to 5 (9 / 2 + 1).  */
-	  unsigned int n = (abstheta * inv_PI_4) + 1;
-	  theta = abstheta - pio2_table[n / 2];
-	  return reduced_cos (theta, n);
-	}
-      else if (isless (abstheta, INFINITY))
-	{
-	  if (abstheta < 0x1p+23)
-	    {
-	      unsigned int n = ((unsigned int) (abstheta * inv_PI_4)) + 1;
-	      double x = n / 2;
-	      theta = (abstheta - x * PI_2_hi) - x * PI_2_lo;
-	      /* Argument reduction needed.  */
-	      return reduced_cos (theta, n);
-	    }
-	  else /* |theta| >= 2^23.  */
-	    {
-	      x = fabsf (x);
-	      int exponent;
-	      GET_FLOAT_WORD (exponent, x);
-	      exponent = (exponent >> FLOAT_EXPONENT_SHIFT)
-			 - FLOAT_EXPONENT_BIAS;
-	      exponent += 3;
-	      exponent /= 28;
-	      double a = invpio4_table[exponent] * x;
-	      double b = invpio4_table[exponent + 1] * x;
-	      double c = invpio4_table[exponent + 2] * x;
-	      double d = invpio4_table[exponent + 3] * x;
-	      uint64_t l = a;
-	      l &= ~0x7;
-	      a -= l;
-	      double e = a + b;
-	      l = e;
-	      e = a - l;
-	      if (l & 1)
-		{
-		  e -= 1.0;
-		  e += b;
-		  e += c;
-		  e += d;
-		  e *= M_PI_4;
-		  return reduced_cos (e, l + 1);
-		}
-	      else
-		{
-		  e += b;
-		  e += c;
-		  e += d;
-		  if (e <= 1.0)
-		    {
-		      e *= M_PI_4;
-		      return reduced_cos (e, l + 1);
-		    }
-		  else
-		    {
-		      l++;
-		      e -= 2.0;
-		      e *= M_PI_4;
-		      return reduced_cos (e, l + 1);
-		    }
-		}
-	    }
-	}
-      else
-	{
-	  int32_t ix;
-	  GET_FLOAT_WORD (ix, abstheta);
-	  /* cos(Inf or NaN) is NaN.  */
-	  if (ix == 0x7f800000) /* Inf.  */
-	    __set_errno (EDOM);
-	  return x - x;
-	}
+      uint32_t xi = asuint (y);
+      int sign = xi >> 31;
+
+      x = reduce_large (xi, &n);
+
+      /* Setup signs for sin and cos - include original sign.  */
+      s = p->sign[(n + sign) & 3];
+
+      if ((n + sign) & 2)
+	p = &sincosf_table[1];
+
+      return sinf_poly (x * s, x * x, p, n ^ 1);
     }
+  else
+    return __math_invalidf (y);
 }
 
 #ifndef COSF
diff --git a/sysdeps/ieee754/flt-32/s_sincosf.h b/sysdeps/ieee754/flt-32/s_sincosf.h
index e12e26867809f0f35840cfbb810218c985924d26..f874592ecc77c00c43f04661a37132d037352890 100644
--- a/sysdeps/ieee754/flt-32/s_sincosf.h
+++ b/sysdeps/ieee754/flt-32/s_sincosf.h
@@ -1,5 +1,5 @@
 /* Used by sinf, cosf and sincosf functions.
-   Copyright (C) 2017-2018 Free Software Foundation, Inc.
+   Copyright (C) 2018 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -20,146 +20,6 @@
 #include <math.h>
 #include "math_config.h"
 
-/* Chebyshev constants for cos, range -PI/4 - PI/4.  */
-static const double C0 = -0x1.ffffffffe98aep-2;
-static const double C1 =  0x1.55555545c50c7p-5;
-static const double C2 = -0x1.6c16b348b6874p-10;
-static const double C3 =  0x1.a00eb9ac43ccp-16;
-static const double C4 = -0x1.23c97dd8844d7p-22;
-
-/* Chebyshev constants for sin, range -PI/4 - PI/4.  */
-static const double S0 = -0x1.5555555551cd9p-3;
-static const double S1 =  0x1.1111110c2688bp-7;
-static const double S2 = -0x1.a019f8b4bd1f9p-13;
-static const double S3 =  0x1.71d7264e6b5b4p-19;
-static const double S4 = -0x1.a947e1674b58ap-26;
-
-/* Chebyshev constants for sin, range 2^-27 - 2^-5.  */
-static const double SS0 = -0x1.555555543d49dp-3;
-static const double SS1 =  0x1.110f475cec8c5p-7;
-
-/* Chebyshev constants for cos, range 2^-27 - 2^-5.  */
-static const double CC0 = -0x1.fffffff5cc6fdp-2;
-static const double CC1 =  0x1.55514b178dac5p-5;
-
-/* PI/2 with 98 bits of accuracy.  */
-static const double PI_2_hi = 0x1.921fb544p+0;
-static const double PI_2_lo = 0x1.0b4611a626332p-34;
-
-static const double SMALL = 0x1p-50; /* 2^-50.  */
-static const double inv_PI_4 = 0x1.45f306dc9c883p+0; /* 4/PI.  */
-
-#define FLOAT_EXPONENT_SHIFT 23
-#define FLOAT_EXPONENT_BIAS 127
-
-static const double pio2_table[] = {
-  0 * M_PI_2,
-  1 * M_PI_2,
-  2 * M_PI_2,
-  3 * M_PI_2,
-  4 * M_PI_2,
-  5 * M_PI_2
-};
-
-static const double invpio4_table[] = {
-  0x0p+0,
-  0x1.45f306cp+0,
-  0x1.c9c882ap-28,
-  0x1.4fe13a8p-58,
-  0x1.f47d4dp-85,
-  0x1.bb81b6cp-112,
-  0x1.4acc9ep-142,
-  0x1.0e4107cp-169
-};
-
-static const double ones[] = { 1.0, -1.0 };
-
-/* Compute the sine value using Chebyshev polynomials where
-   THETA is the range reduced absolute value of the input
-   and it is less than Pi/4,
-   N is calculated as trunc(|x|/(Pi/4)) + 1 and it is used to decide
-   whether a sine or cosine approximation is more accurate and
-   SIGNBIT is used to add the correct sign after the Chebyshev
-   polynomial is computed.  */
-static inline float
-reduced_sin (const double theta, const unsigned int n,
-	 const unsigned int signbit)
-{
-  double sx;
-  const double theta2 = theta * theta;
-  /* We are operating on |x|, so we need to add back the original
-     signbit for sinf.  */
-  double sign;
-  /* Determine positive or negative primary interval.  */
-  sign = ones[((n >> 2) & 1) ^ signbit];
-  /* Are we in the primary interval of sin or cos?  */
-  if ((n & 2) == 0)
-    {
-      /* Here sinf() is calculated using sin Chebyshev polynomial:
-	x+x^3*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))).  */
-      sx = S3 + theta2 * S4;     /* S3+x^2*S4.  */
-      sx = S2 + theta2 * sx;     /* S2+x^2*(S3+x^2*S4).  */
-      sx = S1 + theta2 * sx;     /* S1+x^2*(S2+x^2*(S3+x^2*S4)).  */
-      sx = S0 + theta2 * sx;     /* S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4))).  */
-      sx = theta + theta * theta2 * sx;
-    }
-  else
-    {
-     /* Here sinf() is calculated using cos Chebyshev polynomial:
-	1.0+x^2*(C0+x^2*(C1+x^2*(C2+x^2*(C3+x^2*C4)))).  */
-      sx = C3 + theta2 * C4;     /* C3+x^2*C4.  */
-      sx = C2 + theta2 * sx;     /* C2+x^2*(C3+x^2*C4).  */
-      sx = C1 + theta2 * sx;     /* C1+x^2*(C2+x^2*(C3+x^2*C4)).  */
-      sx = C0 + theta2 * sx;     /* C0+x^2*(C1+x^2*(C2+x^2*(C3+x^2*C4))).  */
-      sx = 1.0 + theta2 * sx;
-    }
-
-  /* Add in the signbit and assign the result.  */
-  return sign * sx;
-}
-
-/* Compute the cosine value using Chebyshev polynomials where
-   THETA is the range reduced absolute value of the input
-   and it is less than Pi/4,
-   N is calculated as trunc(|x|/(Pi/4)) + 1 and it is used to decide
-   whether a sine or cosine approximation is more accurate and
-   the sign of the result.  */
-static inline float
-reduced_cos (double theta, unsigned int n)
-{
-  double sign, cx;
-  const double theta2 = theta * theta;
-
-  /* Determine positive or negative primary interval.  */
-  n += 2;
-  sign = ones[(n >> 2) & 1];
-
-  /* Are we in the primary interval of sin or cos?  */
-  if ((n & 2) == 0)
-    {
-      /* Here cosf() is calculated using sin Chebyshev polynomial:
-	x+x^3*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))).  */
-      cx = S3 + theta2 * S4;
-      cx = S2 + theta2 * cx;
-      cx = S1 + theta2 * cx;
-      cx = S0 + theta2 * cx;
-      cx = theta + theta * theta2 * cx;
-    }
-  else
-    {
-     /* Here cosf() is calculated using cos Chebyshev polynomial:
-	1.0+x^2*(C0+x^2*(C1+x^2*(C2+x^2*(C3+x^2*C4)))).  */
-      cx = C3 + theta2 * C4;
-      cx = C2 + theta2 * cx;
-      cx = C1 + theta2 * cx;
-      cx = C0 + theta2 * cx;
-      cx = 1. + theta2 * cx;
-    }
-  return sign * cx;
-}
-
-/* New sincosf.  */
-
 /* PI * 2^-64.  */
 static const double pi64 = 0x1.921FB54442D18p-62;
 /* PI / 4.  */
@@ -220,6 +80,36 @@ sincosf_poly (double x, double x2, sincos_t *p, int n, float *sinp, float *cosp)
   *cosp = c + x6 * c2;
 }
 
+/* Return the sine of inputs X and X2 (X squared) using the polynomial P.
+   N is the quadrant, and if odd the cosine polynomial is used.  */
+static inline float
+sinf_poly (double x, double x2, sincos_t *p, int n)
+{
+  double x3, x4, x6, x7, s, c, c1, c2, s1;
+
+  if ((n & 1) == 0)
+    {
+      x3 = x * x2;
+      s1 = p->s2 + x2 * p->s3;
+
+      x7 = x3 * x2;
+      s = x + x3 * p->s1;
+
+      return s + x7 * s1;
+    }
+  else
+    {
+      x4 = x2 * x2;
+      c2 = p->c3 + x2 * p->c4;
+      c1 = p->c0 + x2 * p->c1;
+
+      x6 = x4 * x2;
+      c = c1 + x4 * p->c2;
+
+      return c + x6 * c2;
+    }
+}
+
 /* Fast range reduction using single multiply-subtract.  Return the modulo of
    X as a value between -PI/4 and PI/4 and store the quadrant in NP.
    The values for PI/2 and 2/PI are accessed via P.  Since PI/2 as a double
diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
index 138e318dcce81812a64cd812883f709497e66656..5d2333e6bce23324d446ea8ef98ad313fc62cd66 100644
--- a/sysdeps/ieee754/flt-32/s_sinf.c
+++ b/sysdeps/ieee754/flt-32/s_sinf.c
@@ -1,5 +1,5 @@
 /* Compute sine of argument.
-   Copyright (C) 2017-2018 Free Software Foundation, Inc.
+   Copyright (C) 2018 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -16,10 +16,11 @@
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
-#include <errno.h>
+#include <stdint.h>
 #include <math.h>
 #include <math_private.h>
 #include <libm-alias-float.h>
+#include "math_config.h"
 #include "s_sincosf.h"
 
 #ifndef SINF
@@ -28,127 +29,62 @@
 # define SINF_FUNC SINF
 #endif
 
+/* Fast sinf implementation.  Worst-case ULP is 0.56072, maximum relative
+   error is 0.5303p-23.  A single-step signed range reduction is used for
+   small values.  Large inputs have their range reduced using fast integer
+   arithmetic.
+*/
 float
-SINF_FUNC (float x)
+SINF_FUNC (float y)
 {
-  double cx;
-  double theta = x;
-  double abstheta = fabs (theta);
-  /* If |x|< Pi/4.  */
-  if (isless (abstheta, M_PI_4))
+  double x = y;
+  double s;
+  int n;
+  sincos_t *p = &sincosf_table[0];
+
+  if (abstop12 (y) < abstop12 (pio4))
+    {
+      s = x * x;
+
+      if (__glibc_unlikely (abstop12 (y) < abstop12 (0x1p-12f)))
+      {
+	/* Force underflow for tiny y.  */
+	if (__glibc_unlikely (abstop12 (y) < abstop12 (0x1p-126f)))
+	  math_force_eval ((float)s);
+	return y;
+      }
+
+      return sinf_poly (x, s, p, 0);
+    }
+  else if (__glibc_likely (abstop12 (y) < abstop12 (120.0f)))
     {
-      if (abstheta >= 0x1p-5) /* |x| >= 2^-5.  */
-	{
-	  const double theta2 = theta * theta;
-	  /* Chebyshev polynomial of the form for sin
-	     x+x^3*(S0+x^2*(S1+x^2*(S2+x^2*(S3+x^2*S4)))).  */
-	  cx = S3 + theta2 * S4;
-	  cx = S2 + theta2 * cx;
-	  cx = S1 + theta2 * cx;
-	  cx = S0 + theta2 * cx;
-	  cx = theta + theta * theta2 * cx;
-	  return cx;
-	}
-      else if (abstheta >= 0x1p-27)     /* |x| >= 2^-27.  */
-	{
-	  /* A simpler Chebyshev approximation is close enough for this range:
-	     for sin: x+x^3*(SS0+x^2*SS1).  */
-	  const double theta2 = theta * theta;
-	  cx = SS0 + theta2 * SS1;
-	  cx = theta + theta * theta2 * cx;
-	  return cx;
-	}
-      else
-	{
-	  /* Handle some special cases.  */
-	  if (theta)
-	    return theta - (theta * SMALL);
-	  else
-	    return theta;
-	}
+      x = reduce_fast (x, p, &n);
+
+      /* Setup the signs for sin and cos.  */
+      s = p->sign[n & 3];
+
+      if (n & 2)
+	p = &sincosf_table[1];
+
+      return sinf_poly (x * s, x * x, p, n);
     }
-  else                          /* |x| >= Pi/4.  */
+  else if (abstop12 (y) < abstop12 (INFINITY))
     {
-      unsigned int signbit = isless (x, 0);
-      if (isless (abstheta, 9 * M_PI_4))        /* |x| < 9*Pi/4.  */
-	{
-	  /* There are cases where FE_UPWARD rounding mode can
-	     produce a result of abstheta * inv_PI_4 == 9,
-	     where abstheta < 9pi/4, so the domain for
-	     pio2_table must go to 5 (9 / 2 + 1).  */
-	  unsigned int n = (abstheta * inv_PI_4) + 1;
-	  theta = abstheta - pio2_table[n / 2];
-	  return reduced_sin (theta, n, signbit);
-	}
-      else if (isless (abstheta, INFINITY))
-	{
-	  if (abstheta < 0x1p+23)     /* |x| < 2^23.  */
-	    {
-	      unsigned int n = ((unsigned int) (abstheta * inv_PI_4)) + 1;
-	      double x = n / 2;
-	      theta = (abstheta - x * PI_2_hi) - x * PI_2_lo;
-	      /* Argument reduction needed.  */
-	      return reduced_sin (theta, n, signbit);
-	    }
-	  else                  /* |x| >= 2^23.  */
-	    {
-	      x = fabsf (x);
-	      int exponent;
-	      GET_FLOAT_WORD (exponent, x);
-	      exponent
-	        = (exponent >> FLOAT_EXPONENT_SHIFT) - FLOAT_EXPONENT_BIAS;
-	      exponent += 3;
-	      exponent /= 28;
-	      double a = invpio4_table[exponent] * x;
-	      double b = invpio4_table[exponent + 1] * x;
-	      double c = invpio4_table[exponent + 2] * x;
-	      double d = invpio4_table[exponent + 3] * x;
-	      uint64_t l = a;
-	      l &= ~0x7;
-	      a -= l;
-	      double e = a + b;
-	      l = e;
-	      e = a - l;
-	      if (l & 1)
-	        {
-	          e -= 1.0;
-	          e += b;
-	          e += c;
-	          e += d;
-	          e *= M_PI_4;
-	          return reduced_sin (e, l + 1, signbit);
-	        }
-	      else
-		{
-		  e += b;
-		  e += c;
-		  e += d;
-		  if (e <= 1.0)
-		    {
-		      e *= M_PI_4;
-		      return reduced_sin (e, l + 1, signbit);
-		    }
-		  else
-		    {
-		      l++;
-		      e -= 2.0;
-		      e *= M_PI_4;
-		      return reduced_sin (e, l + 1, signbit);
-		    }
-		}
-	    }
-	}
-      else
-	{
-	  int32_t ix;
-	  /* High word of x.  */
-	  GET_FLOAT_WORD (ix, abstheta);
-	  /* Sin(Inf or NaN) is NaN.  */
-	  if (ix == 0x7f800000)
-	    __set_errno (EDOM);
-	  return x - x;
-	}
+      uint32_t xi = asuint (y);
+      int sign = xi >> 31;
+
+      x = reduce_large (xi, &n);
+
+      /* Setup signs for sin and cos - include original sign.  */
+      s = p->sign[(n + sign) & 3];
+
+      if ((n + sign) & 2)
+	p = &sincosf_table[1];
+
+      return sinf_poly (x * s, x * x, p, n);
     }
+  else
+    return __math_invalidf (y);
 }
 
 #ifndef SINF
diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps
index bbb8a4d0754dbe6665682cd8a7f51f7319a14014..c3c122b14cc1f976adb9df0723c84e896edfa426 100644
--- a/sysdeps/x86_64/fpu/libm-test-ulps
+++ b/sysdeps/x86_64/fpu/libm-test-ulps
@@ -1271,24 +1271,30 @@ ldouble: 1
 
 Function: "cos_downward":
 double: 1
+float: 1
 float128: 3
 idouble: 1
+ifloat: 1
 ifloat128: 3
 ildouble: 3
 ldouble: 3
 
 Function: "cos_towardzero":
 double: 1
+float: 1
 float128: 1
 idouble: 1
+ifloat: 1
 ifloat128: 1
 ildouble: 2
 ldouble: 2
 
 Function: "cos_upward":
 double: 1
+float: 1
 float128: 2
 idouble: 1
+ifloat: 1
 ifloat128: 2
 ildouble: 2
 ldouble: 2
@@ -2539,24 +2545,30 @@ ldouble: 1
 
 Function: "sin_downward":
 double: 1
+float: 1
 float128: 3
 idouble: 1
+ifloat:1
 ifloat128: 3
 ildouble: 3
 ldouble: 3
 
 Function: "sin_towardzero":
 double: 1
+float:1
 float128: 2
 idouble: 1
+ifloat: 1
 ifloat128: 2
 ildouble: 2
 ldouble: 2
 
 Function: "sin_upward":
 double: 1
+float:1
 float128: 3
 idouble: 1
+ifloat: 1
 ifloat128: 3
 ildouble: 3
 ldouble: 3
diff --git a/sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c b/sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c
index 64abe7abca1b9caabbf19acbcb0e3c5952833cad..0b80c4fe0dddad59c67e0cc24f4d36fec0266de1 100644
--- a/sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c
+++ b/sysdeps/x86_64/fpu/multiarch/s_sincosf-fma.c
@@ -21,7 +21,6 @@
 #include <math_private.h>
 #include <x86intrin.h>
 #include <libm-alias-float.h>
-#include "s_sincosf.h"
 
 #define SINCOSF __sincosf_fma
 
@@ -31,6 +30,38 @@
 # define SINCOSF_FUNC SINCOSF
 #endif
 
+/* PI/2 with 98 bits of accuracy.  */
+static const double PI_2_hi = 0x1.921fb544p+0;
+static const double PI_2_lo = 0x1.0b4611a626332p-34;
+
+static const double SMALL = 0x1p-50; /* 2^-50.  */
+static const double inv_PI_4 = 0x1.45f306dc9c883p+0; /* 4/PI.  */
+
+#define FLOAT_EXPONENT_SHIFT 23
+#define FLOAT_EXPONENT_BIAS 127
+
+static const double pio2_table[] = {
+  0 * M_PI_2,
+  1 * M_PI_2,
+  2 * M_PI_2,
+  3 * M_PI_2,
+  4 * M_PI_2,
+  5 * M_PI_2
+};
+
+static const double invpio4_table[] = {
+  0x0p+0,
+  0x1.45f306cp+0,
+  0x1.c9c882ap-28,
+  0x1.4fe13a8p-58,
+  0x1.f47d4dp-85,
+  0x1.bb81b6cp-112,
+  0x1.4acc9ep-142,
+  0x1.0e4107cp-169
+};
+
+static const double ones[] = { 1.0, -1.0 };
+
 /* Chebyshev constants for sin and cos, range -PI/4 - PI/4.  */
 static const __v2df V0 = { -0x1.5555555551cd9p-3, -0x1.ffffffffe98aep-2};
 static const __v2df V1 = { 0x1.1111110c2688bp-7, 0x1.55555545c50c7p-5 };