From patchwork Tue Dec 28 20:11:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573812 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=uF3fySQ8; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm1J2mq9z9sVq for ; Wed, 29 Dec 2021 07:12:52 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 354E63858406 for ; Tue, 28 Dec 2021 20:12:50 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 354E63858406 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722370; bh=9xeyvbgPNpFmNYRNG5/JrfgqpZwBQUT5n83SlkUuD6o=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=uF3fySQ8x6JN7oa7jgw7jKekUhiQ8LjHeI213VRBnTYtg9ISiAHrU6EQWBljCiFCf Jk6RlcvNlD2aDn7mu1MKzwpzaYl8Lm8oBCi7ul+jO5sekXoIGYzvJ9QYN9QSxTOaCt MqhjcMkhIyASL6Ey89zC7rfBe4xLHuMNhSEydIYg= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by sourceware.org (Postfix) with ESMTPS id A7B083858404 for ; Tue, 28 Dec 2021 20:11:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A7B083858404 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="240197866" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="240197866" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="523745170" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga008.jf.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsV016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 01/18] x86-64: Add vector atan/atanf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:13 -0800 Message-Id: <20211228201130.737370-2-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized atan/atanf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector atan/atanf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 ++ .../fpu/multiarch/svml_d_atan2_core-sse2.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_atan2_core.c | 27 ++ .../fpu/multiarch/svml_d_atan2_core_sse4.S | 245 ++++++++++++++++++ .../fpu/multiarch/svml_d_atan4_core-sse.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_atan4_core.c | 27 ++ .../fpu/multiarch/svml_d_atan4_core_avx2.S | 225 ++++++++++++++++ .../fpu/multiarch/svml_d_atan8_core-avx2.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_atan8_core.c | 27 ++ .../fpu/multiarch/svml_d_atan8_core_avx512.S | 213 +++++++++++++++ .../fpu/multiarch/svml_s_atanf16_core-avx2.S | 20 ++ .../fpu/multiarch/svml_s_atanf16_core.c | 28 ++ .../multiarch/svml_s_atanf16_core_avx512.S | 174 +++++++++++++ .../fpu/multiarch/svml_s_atanf4_core-sse2.S | 20 ++ .../x86_64/fpu/multiarch/svml_s_atanf4_core.c | 28 ++ .../fpu/multiarch/svml_s_atanf4_core_sse4.S | 164 ++++++++++++ .../fpu/multiarch/svml_s_atanf8_core-sse.S | 20 ++ .../x86_64/fpu/multiarch/svml_s_atanf8_core.c | 28 ++ .../fpu/multiarch/svml_s_atanf8_core_avx2.S | 148 +++++++++++ sysdeps/x86_64/fpu/svml_d_atan2_core.S | 29 +++ sysdeps/x86_64/fpu/svml_d_atan4_core.S | 29 +++ sysdeps/x86_64/fpu/svml_d_atan4_core_avx.S | 25 ++ sysdeps/x86_64/fpu/svml_d_atan8_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_atanf16_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_atanf4_core.S | 29 +++ sysdeps/x86_64/fpu/svml_s_atanf8_core.S | 29 +++ sysdeps/x86_64/fpu/svml_s_atanf8_core_avx.S | 25 ++ .../x86_64/fpu/test-double-libmvec-atan-avx.c | 1 + .../fpu/test-double-libmvec-atan-avx2.c | 1 + .../fpu/test-double-libmvec-atan-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-atan.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-atanf-avx.c | 1 + .../fpu/test-float-libmvec-atanf-avx2.c | 1 + .../fpu/test-float-libmvec-atanf-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-atanf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 1741 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 2ccdd1fc53..b4647ca918 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -109,4 +109,15 @@ #define __DECL_SIMD_acosf32x #define __DECL_SIMD_acosf64x #define __DECL_SIMD_acosf128x + +#define __DECL_SIMD_atan +#define __DECL_SIMD_atanf +#define __DECL_SIMD_atanl +#define __DECL_SIMD_atanf16 +#define __DECL_SIMD_atanf32 +#define __DECL_SIMD_atanf64 +#define __DECL_SIMD_atanf128 +#define __DECL_SIMD_atanf32x +#define __DECL_SIMD_atanf64x +#define __DECL_SIMD_atanf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 2cc6654208..3e27c21f21 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -54,7 +54,7 @@ __MATHCALL_VEC (acos,, (_Mdouble_ __x)); /* Arc sine of X. */ __MATHCALL (asin,, (_Mdouble_ __x)); /* Arc tangent of X. */ -__MATHCALL (atan,, (_Mdouble_ __x)); +__MATHCALL_VEC (atan,, (_Mdouble_ __x)); /* Arc tangent of Y/X. */ __MATHCALL (atan2,, (_Mdouble_ __y, _Mdouble_ __x)); diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index b37b55777e..a93258db6f 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -47,10 +47,18 @@ GLIBC_2.22 _ZGVeN8v_sin F GLIBC_2.22 _ZGVeN8vv_pow F GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F +GLIBC_2.35 _ZGVbN2v_atan F GLIBC_2.35 _ZGVbN4v_acosf F +GLIBC_2.35 _ZGVbN4v_atanf F GLIBC_2.35 _ZGVcN4v_acos F +GLIBC_2.35 _ZGVcN4v_atan F GLIBC_2.35 _ZGVcN8v_acosf F +GLIBC_2.35 _ZGVcN8v_atanf F GLIBC_2.35 _ZGVdN4v_acos F +GLIBC_2.35 _ZGVdN4v_atan F GLIBC_2.35 _ZGVdN8v_acosf F +GLIBC_2.35 _ZGVdN8v_atanf F GLIBC_2.35 _ZGVeN16v_acosf F +GLIBC_2.35 _ZGVeN16v_atanf F GLIBC_2.35 _ZGVeN8v_acos F +GLIBC_2.35 _ZGVeN8v_atan F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index dabb74cbb9..1c0e5c5e35 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -62,6 +62,10 @@ # define __DECL_SIMD_acos __DECL_SIMD_x86_64 # undef __DECL_SIMD_acosf # define __DECL_SIMD_acosf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_atan +# define __DECL_SIMD_atan __DECL_SIMD_x86_64 +# undef __DECL_SIMD_atanf +# define __DECL_SIMD_atanf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 4bcbd1fbce..ddcccb11d7 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -30,6 +30,8 @@ !GCC$ builtin (powf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (acos) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (acosf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (atan) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (atanf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -45,3 +47,5 @@ !GCC$ builtin (powf) attributes simd (notinbranch) if('x32') !GCC$ builtin (acos) attributes simd (notinbranch) if('x32') !GCC$ builtin (acosf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (atan) attributes simd (notinbranch) if('x32') +!GCC$ builtin (atanf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 7acf1f306c..dae0887f13 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -23,6 +23,7 @@ postclean-generated += libmvec.mk # Define for both math and mathvec directories. libmvec-funcs = \ acos \ + atan \ cos \ exp \ log \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 2985fe7ca7..424f6d526e 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -15,6 +15,8 @@ libmvec { } GLIBC_2.35 { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; + _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; + _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; } } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index 6c12976c82..2e64e59803 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -164,6 +164,26 @@ float: 2 float128: 2 ldouble: 1 +Function: "atan_vlen16": +float: 1 + +Function: "atan_vlen2": +double: 1 + +Function: "atan_vlen4": +double: 1 +float: 1 + +Function: "atan_vlen4_avx2": +double: 1 + +Function: "atan_vlen8": +double: 1 +float: 1 + +Function: "atan_vlen8_avx2": +float: 1 + Function: "atanh": double: 2 float: 2 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core-sse2.S new file mode 100644 index 0000000000..115e5223aa --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized atan, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_atan _ZGVbN2v_atan_sse2 +#include "../svml_d_atan2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core.c new file mode 100644 index 0000000000..93f079ffcb --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized atan, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_atan +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_atan, __GI__ZGVbN2v_atan, __redirect__ZGVbN2v_atan) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core_sse4.S new file mode 100644 index 0000000000..f0ad036b9e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan2_core_sse4.S @@ -0,0 +1,245 @@ +/* Function atan vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + */ + +/* Offsets for data table __svml_datan_data_internal_avx512 + */ +#define AbsMask 0 +#define Shifter 16 +#define MaxThreshold 32 +#define MOne 48 +#define One 64 +#define LargeX 80 +#define Zero 96 +#define Tbl_H 112 +#define Tbl_L 368 +#define dIndexMed 624 +#define Pi2 640 +#define Pi2_low 656 +#define coeff 672 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_atan_sse4) + lea Tbl_H+128+__svml_datan_data_internal_avx512(%rip), %rcx + movups __svml_datan_data_internal_avx512(%rip), %xmm4 + movups Shifter+__svml_datan_data_internal_avx512(%rip), %xmm3 + andps %xmm0, %xmm4 + movaps %xmm3, %xmm12 + movaps %xmm4, %xmm5 + addpd %xmm4, %xmm12 + movaps %xmm12, %xmm7 + +/* + * table lookup sequence + * VPERMUTE not available + */ + movaps %xmm12, %xmm10 + subpd %xmm3, %xmm7 + subpd %xmm7, %xmm5 + mulpd %xmm4, %xmm7 + movups MaxThreshold+__svml_datan_data_internal_avx512(%rip), %xmm2 + psllq $3, %xmm10 + +/* saturate X range */ + movups LargeX+__svml_datan_data_internal_avx512(%rip), %xmm8 + pxor %xmm4, %xmm0 + cmplepd %xmm4, %xmm2 + addpd One+__svml_datan_data_internal_avx512(%rip), %xmm7 + minpd %xmm4, %xmm8 + movups MOne+__svml_datan_data_internal_avx512(%rip), %xmm6 + movaps %xmm2, %xmm1 + movaps %xmm2, %xmm9 + andnps %xmm5, %xmm1 + andps %xmm2, %xmm6 + andnps %xmm7, %xmm9 + andps %xmm2, %xmm8 + orps %xmm6, %xmm1 + orps %xmm8, %xmm9 + +/* R+Rl = DiffX/Y */ + divpd %xmm9, %xmm1 + pand .FLT_11(%rip), %xmm10 + +/* set table value to Pi/2 for large X */ + movups Pi2+__svml_datan_data_internal_avx512(%rip), %xmm4 + movd %xmm10, %eax + andps %xmm2, %xmm4 + pshufd $2, %xmm10, %xmm11 + movaps %xmm2, %xmm10 + +/* polynomial evaluation */ + movaps %xmm1, %xmm2 + mulpd %xmm1, %xmm2 + movd %xmm11, %edx + movups coeff+__svml_datan_data_internal_avx512(%rip), %xmm5 + movaps %xmm2, %xmm7 + movups coeff+32+__svml_datan_data_internal_avx512(%rip), %xmm6 + movaps %xmm2, %xmm9 + mulpd %xmm2, %xmm5 + mulpd %xmm2, %xmm7 + addpd coeff+16+__svml_datan_data_internal_avx512(%rip), %xmm5 + mulpd %xmm2, %xmm6 + mulpd %xmm7, %xmm5 + addpd coeff+48+__svml_datan_data_internal_avx512(%rip), %xmm6 + mulpd %xmm1, %xmm9 + addpd %xmm5, %xmm6 + movups coeff+64+__svml_datan_data_internal_avx512(%rip), %xmm8 + mulpd %xmm2, %xmm8 + mulpd %xmm6, %xmm7 + addpd coeff+80+__svml_datan_data_internal_avx512(%rip), %xmm8 + addpd %xmm7, %xmm8 + mulpd %xmm8, %xmm9 + movups dIndexMed+__svml_datan_data_internal_avx512(%rip), %xmm14 + cmplepd %xmm12, %xmm14 + addpd %xmm9, %xmm1 + movslq %eax, %rax + movaps %xmm14, %xmm3 + movslq %edx, %rdx + movsd -128(%rax,%rcx), %xmm13 + movsd (%rcx,%rax), %xmm15 + movhpd -128(%rdx,%rcx), %xmm13 + movhpd (%rcx,%rdx), %xmm15 + andnps %xmm13, %xmm3 + andps %xmm14, %xmm15 + orps %xmm15, %xmm3 + andnps %xmm3, %xmm10 + orps %xmm4, %xmm10 + addpd %xmm1, %xmm10 + pxor %xmm10, %xmm0 + ret + +END(_ZGVbN2v_atan_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_datan_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 AbsMask[2][2]; + __declspec(align(16)) VUINT32 Shifter[2][2]; + __declspec(align(16)) VUINT32 MaxThreshold[2][2]; + __declspec(align(16)) VUINT32 MOne[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 LargeX[2][2]; + __declspec(align(16)) VUINT32 Zero[2][2]; + __declspec(align(16)) VUINT32 Tbl_H[32][2]; + __declspec(align(16)) VUINT32 Tbl_L[32][2]; + __declspec(align(16)) VUINT32 dIndexMed[2][2]; + __declspec(align(16)) VUINT32 Pi2[2][2]; + __declspec(align(16)) VUINT32 Pi2_low[2][2]; + __declspec(align(16)) VUINT32 coeff[6][2][2]; + } __svml_datan_data_internal_avx512; +#endif +__svml_datan_data_internal_avx512: + /*== AbsMask ==*/ + .quad 0x7fffffffffffffff, 0x7fffffffffffffff + /*== Shifter ==*/ + .align 16 + .quad 0x4318000000000000, 0x4318000000000000 + /*== MaxThreshold ==*/ + .align 16 + .quad 0x401f800000000000, 0x401f800000000000 + /*== MOne ==*/ + .align 16 + .quad 0xbff0000000000000, 0xbff0000000000000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== LargeX ==*/ + .align 16 + .quad 0x47f0000000000000, 0x47f0000000000000 + /*== Zero ==*/ + .align 16 + .quad 0x0000000000000000, 0x0000000000000000 + /*== Tbl_H ==*/ + .align 16 + .quad 0x0000000000000000, 0x3fcf5b75f92c80dd + .quad 0x3fddac670561bb4f, 0x3fe4978fa3269ee1 + .quad 0x3fe921fb54442d18, 0x3fecac7c57846f9e + .quad 0x3fef730bd281f69b, 0x3ff0d38f2c5ba09f + .quad 0x3ff1b6e192ebbe44, 0x3ff270ef55a53a25 + .quad 0x3ff30b6d796a4da8, 0x3ff38d6a6ce13353 + .quad 0x3ff3fc176b7a8560, 0x3ff45b54837351a0 + .quad 0x3ff4ae10fc6589a5, 0x3ff4f68dea672617 + .quad 0x3ff5368c951e9cfd, 0x3ff56f6f33a3e6a7 + .quad 0x3ff5a25052114e60, 0x3ff5d013c41adabd + .quad 0x3ff5f97315254857, 0x3ff61f06c6a92b89 + .quad 0x3ff6414d44094c7c, 0x3ff660b02c736a06 + .quad 0x3ff67d8863bc99bd, 0x3ff698213a9d5053 + .quad 0x3ff6b0bae830c070, 0x3ff6c78c7edeb195 + .quad 0x3ff6dcc57bb565fd, 0x3ff6f08f07435fec + .quad 0x3ff7030cf9403197, 0x3ff7145eac2088a4 + /*== Tbl_L ==*/ + .align 16 + .quad 0x0000000000000000, 0x3c68ab6e3cf7afbd + .quad 0x3c7a2b7f222f65e2, 0x3c72419a87f2a458 + .quad 0x3c81a62633145c07, 0x3c80dae13ad18a6b + .quad 0x3c7007887af0cbbd, 0xbc9bd0dc231bfd70 + .quad 0x3c9b1b466a88828e, 0xbc9a66b1af5f84fb + .quad 0x3c96254cb03bb199, 0xbc812c77e8a80f5c + .quad 0xbc4441a3bd3f1084, 0x3c79e4a72eedacc4 + .quad 0xbc93b03e8a27f555, 0x3c9934f9f2b0020e + .quad 0xbc996f47948a99f1, 0xbc7df6edd6f1ec3b + .quad 0x3c78c2d0c89de218, 0x3c9f82bba194dd5d + .quad 0xbc831151a43b51ca, 0xbc8487d50bceb1a5 + .quad 0xbc9c5f60a65c7397, 0xbc7acb6afb332a0f + .quad 0xbc99b7bd2e1e8c9c, 0xbc9b9839085189e3 + .quad 0xbc97d1ab82ffb70b, 0x3c99239ad620ffe2 + .quad 0xbc929c86447928e7, 0xbc8957a7170df016 + .quad 0xbc7cbe1896221608, 0xbc9fda5797b32a0b + /*== dIndexMed ==*/ + .align 16 + .quad 0x4318000000000010, 0x4318000000000010 + /*== Pi2 ==*/ + .align 16 + .quad 0x3ff921fb54442d18, 0x3ff921fb54442d18 + /*== Pi2_low ==*/ + .align 16 + .quad 0x3c91a62633145c07, 0x3c91a62633145c07 + /*== coeff6 ==*/ + .align 16 + .quad 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97 + .quad 0xbfb74257c46790cc, 0xbfb74257c46790cc + .quad 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0 + .quad 0xbfc249248eef04da, 0xbfc249248eef04da + .quad 0x3fc999999998741e, 0x3fc999999998741e + .quad 0xbfd555555555554d, 0xbfd555555555554d + .align 16 + .type __svml_datan_data_internal_avx512,@object + .size __svml_datan_data_internal_avx512,.-__svml_datan_data_internal_avx512 + .align 16 + +.FLT_11: + .long 0x00000078,0x00000000,0x00000078,0x00000000 + .type .FLT_11,@object + .size .FLT_11,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core-sse.S new file mode 100644 index 0000000000..79c48dbc91 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized atan, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_atan _ZGVdN4v_atan_sse_wrapper +#include "../svml_d_atan4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core.c new file mode 100644 index 0000000000..64ce66b9fd --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized atan, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_atan +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_atan, __GI__ZGVdN4v_atan, __redirect__ZGVdN4v_atan) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core_avx2.S new file mode 100644 index 0000000000..50336514d7 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan4_core_avx2.S @@ -0,0 +1,225 @@ +/* Function atan vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + */ + +/* Offsets for data table __svml_datan_data_internal_avx512 + */ +#define AbsMask 0 +#define Shifter 32 +#define MaxThreshold 64 +#define MOne 96 +#define One 128 +#define LargeX 160 +#define Zero 192 +#define Tbl_H 224 +#define Tbl_L 480 +#define dIndexMed 736 +#define Pi2 768 +#define Pi2_low 800 +#define coeff 832 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_atan_avx2) + lea Tbl_H+128+__svml_datan_data_internal_avx512(%rip), %rdi + vmovupd Shifter+__svml_datan_data_internal_avx512(%rip), %ymm4 + vmovupd One+__svml_datan_data_internal_avx512(%rip), %ymm9 + +/* saturate X range */ + vmovupd LargeX+__svml_datan_data_internal_avx512(%rip), %ymm6 + vandpd __svml_datan_data_internal_avx512(%rip), %ymm0, %ymm7 + vaddpd %ymm4, %ymm7, %ymm2 + vcmpge_oqpd MaxThreshold+__svml_datan_data_internal_avx512(%rip), %ymm7, %ymm3 + vminpd %ymm7, %ymm6, %ymm10 + vsubpd %ymm4, %ymm2, %ymm5 + +/* + * table lookup sequence + * VPERMUTE not available + */ + vpsllq $3, %ymm2, %ymm13 + vsubpd %ymm5, %ymm7, %ymm8 + vcmpge_oqpd dIndexMed+__svml_datan_data_internal_avx512(%rip), %ymm2, %ymm2 + vfmadd231pd %ymm7, %ymm5, %ymm9 + vpand .FLT_11(%rip), %ymm13, %ymm14 + vblendvpd %ymm3, MOne+__svml_datan_data_internal_avx512(%rip), %ymm8, %ymm11 + vblendvpd %ymm3, %ymm10, %ymm9, %ymm12 + vxorpd %ymm0, %ymm7, %ymm1 + +/* R+Rl = DiffX/Y */ + vdivpd %ymm12, %ymm11, %ymm0 + vextractf128 $1, %ymm14, %xmm4 + vmovd %xmm14, %eax + vmovd %xmm4, %ecx + movslq %eax, %rax + vpextrd $2, %xmm14, %edx + movslq %ecx, %rcx + vpextrd $2, %xmm4, %esi + movslq %edx, %rdx + movslq %esi, %rsi + vmovsd -128(%rax,%rdi), %xmm15 + vmovsd (%rdi,%rax), %xmm7 + vmovsd -128(%rcx,%rdi), %xmm5 + vmovsd (%rdi,%rcx), %xmm9 + vmovhpd -128(%rdx,%rdi), %xmm15, %xmm15 + vmovhpd (%rdi,%rdx), %xmm7, %xmm8 + vmovhpd -128(%rsi,%rdi), %xmm5, %xmm6 + vmovhpd (%rdi,%rsi), %xmm9, %xmm10 + +/* polynomial evaluation */ + vmulpd %ymm0, %ymm0, %ymm5 + vmulpd %ymm5, %ymm5, %ymm4 + vinsertf128 $1, %xmm6, %ymm15, %ymm11 + vinsertf128 $1, %xmm10, %ymm8, %ymm12 + vblendvpd %ymm2, %ymm12, %ymm11, %ymm13 + vmovupd coeff+__svml_datan_data_internal_avx512(%rip), %ymm8 + vmovupd coeff+64+__svml_datan_data_internal_avx512(%rip), %ymm2 + vmulpd %ymm5, %ymm0, %ymm6 + vfmadd213pd coeff+32+__svml_datan_data_internal_avx512(%rip), %ymm5, %ymm8 + vfmadd213pd coeff+96+__svml_datan_data_internal_avx512(%rip), %ymm5, %ymm2 + +/* set table value to Pi/2 for large X */ + vblendvpd %ymm3, Pi2+__svml_datan_data_internal_avx512(%rip), %ymm13, %ymm7 + vmovupd coeff+128+__svml_datan_data_internal_avx512(%rip), %ymm3 + vfmadd213pd %ymm2, %ymm4, %ymm8 + vfmadd213pd coeff+160+__svml_datan_data_internal_avx512(%rip), %ymm3, %ymm5 + vfmadd213pd %ymm5, %ymm4, %ymm8 + vfmadd213pd %ymm0, %ymm6, %ymm8 + vaddpd %ymm8, %ymm7, %ymm0 + vxorpd %ymm1, %ymm0, %ymm0 + ret + +END(_ZGVdN4v_atan_avx2) + + .section .rodata, "a" + .align 32 + +.FLT_11: + .long 0x00000078,0x00000000,0x00000078,0x00000000,0x00000078,0x00000000,0x00000078,0x00000000 + .type .FLT_11,@object + .size .FLT_11,32 + .align 32 + +#ifdef __svml_datan_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 AbsMask[4][2]; + __declspec(align(32)) VUINT32 Shifter[4][2]; + __declspec(align(32)) VUINT32 MaxThreshold[4][2]; + __declspec(align(32)) VUINT32 MOne[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 LargeX[4][2]; + __declspec(align(32)) VUINT32 Zero[4][2]; + __declspec(align(32)) VUINT32 Tbl_H[32][2]; + __declspec(align(32)) VUINT32 Tbl_L[32][2]; + __declspec(align(32)) VUINT32 dIndexMed[4][2]; + __declspec(align(32)) VUINT32 Pi2[4][2]; + __declspec(align(32)) VUINT32 Pi2_low[4][2]; + __declspec(align(32)) VUINT32 coeff[6][4][2]; + } __svml_datan_data_internal_avx512; +#endif +__svml_datan_data_internal_avx512: + /*== AbsMask ==*/ + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== Shifter ==*/ + .align 32 + .quad 0x4318000000000000, 0x4318000000000000, 0x4318000000000000, 0x4318000000000000 + /*== MaxThreshold ==*/ + .align 32 + .quad 0x401f800000000000, 0x401f800000000000, 0x401f800000000000, 0x401f800000000000 + /*== MOne ==*/ + .align 32 + .quad 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== LargeX ==*/ + .align 32 + .quad 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000 + /*== Zero ==*/ + .align 32 + .quad 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000 + /*== Tbl_H ==*/ + .align 32 + .quad 0x0000000000000000, 0x3fcf5b75f92c80dd + .quad 0x3fddac670561bb4f, 0x3fe4978fa3269ee1 + .quad 0x3fe921fb54442d18, 0x3fecac7c57846f9e + .quad 0x3fef730bd281f69b, 0x3ff0d38f2c5ba09f + .quad 0x3ff1b6e192ebbe44, 0x3ff270ef55a53a25 + .quad 0x3ff30b6d796a4da8, 0x3ff38d6a6ce13353 + .quad 0x3ff3fc176b7a8560, 0x3ff45b54837351a0 + .quad 0x3ff4ae10fc6589a5, 0x3ff4f68dea672617 + .quad 0x3ff5368c951e9cfd, 0x3ff56f6f33a3e6a7 + .quad 0x3ff5a25052114e60, 0x3ff5d013c41adabd + .quad 0x3ff5f97315254857, 0x3ff61f06c6a92b89 + .quad 0x3ff6414d44094c7c, 0x3ff660b02c736a06 + .quad 0x3ff67d8863bc99bd, 0x3ff698213a9d5053 + .quad 0x3ff6b0bae830c070, 0x3ff6c78c7edeb195 + .quad 0x3ff6dcc57bb565fd, 0x3ff6f08f07435fec + .quad 0x3ff7030cf9403197, 0x3ff7145eac2088a4 + /*== Tbl_L ==*/ + .align 32 + .quad 0x0000000000000000, 0x3c68ab6e3cf7afbd + .quad 0x3c7a2b7f222f65e2, 0x3c72419a87f2a458 + .quad 0x3c81a62633145c07, 0x3c80dae13ad18a6b + .quad 0x3c7007887af0cbbd, 0xbc9bd0dc231bfd70 + .quad 0x3c9b1b466a88828e, 0xbc9a66b1af5f84fb + .quad 0x3c96254cb03bb199, 0xbc812c77e8a80f5c + .quad 0xbc4441a3bd3f1084, 0x3c79e4a72eedacc4 + .quad 0xbc93b03e8a27f555, 0x3c9934f9f2b0020e + .quad 0xbc996f47948a99f1, 0xbc7df6edd6f1ec3b + .quad 0x3c78c2d0c89de218, 0x3c9f82bba194dd5d + .quad 0xbc831151a43b51ca, 0xbc8487d50bceb1a5 + .quad 0xbc9c5f60a65c7397, 0xbc7acb6afb332a0f + .quad 0xbc99b7bd2e1e8c9c, 0xbc9b9839085189e3 + .quad 0xbc97d1ab82ffb70b, 0x3c99239ad620ffe2 + .quad 0xbc929c86447928e7, 0xbc8957a7170df016 + .quad 0xbc7cbe1896221608, 0xbc9fda5797b32a0b + /*== dIndexMed ==*/ + .align 32 + .quad 0x4318000000000010, 0x4318000000000010, 0x4318000000000010, 0x4318000000000010 + /*== Pi2 ==*/ + .align 32 + .quad 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18 + /*== Pi2_low ==*/ + .align 32 + .quad 0x3c91a62633145c07, 0x3c91a62633145c07, 0x3c91a62633145c07, 0x3c91a62633145c07 + /*== coeff6 ==*/ + .align 32 + .quad 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97 + .quad 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc + .quad 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0 + .quad 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da + .quad 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e + .quad 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d + .align 32 + .type __svml_datan_data_internal_avx512,@object + .size __svml_datan_data_internal_avx512,.-__svml_datan_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core-avx2.S new file mode 100644 index 0000000000..723734e10b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized atan, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_atan _ZGVeN8v_atan_avx2_wrapper +#include "../svml_d_atan8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core.c new file mode 100644 index 0000000000..e97a41b6bc --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized atan, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_atan +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_atan, __GI__ZGVeN8v_atan, __redirect__ZGVeN8v_atan) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core_avx512.S new file mode 100644 index 0000000000..fa6cb47308 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan8_core_avx512.S @@ -0,0 +1,213 @@ +/* Function atan vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + */ + +/* Offsets for data table __svml_datan_data_internal_avx512 + */ +#define AbsMask 0 +#define Shifter 64 +#define MaxThreshold 128 +#define MOne 192 +#define One 256 +#define LargeX 320 +#define Zero 384 +#define Tbl_H 448 +#define dIndexMed 704 +#define Pi2 768 +#define coeff_1 832 +#define coeff_2 896 +#define coeff_3 960 +#define coeff_4 1024 +#define coeff_5 1088 +#define coeff_6 1152 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_atan_skx) + vmovups Shifter+__svml_datan_data_internal_avx512(%rip), %zmm4 + vmovups MaxThreshold+__svml_datan_data_internal_avx512(%rip), %zmm3 + vmovups One+__svml_datan_data_internal_avx512(%rip), %zmm9 + +/* saturate X range */ + vmovups LargeX+__svml_datan_data_internal_avx512(%rip), %zmm7 + vandpd __svml_datan_data_internal_avx512(%rip), %zmm0, %zmm8 + +/* R+Rl = DiffX/Y */ + vbroadcastsd .FLT_10(%rip), %zmm15 + vaddpd {rn-sae}, %zmm4, %zmm8, %zmm2 + vxorpd %zmm0, %zmm8, %zmm1 + vcmppd $29, {sae}, %zmm3, %zmm8, %k2 + +/* round to 2 bits after binary point */ + vreducepd $40, {sae}, %zmm8, %zmm6 + vsubpd {rn-sae}, %zmm4, %zmm2, %zmm5 + +/* + * if|X|>=MaxThreshold, set DiffX=-1 + * VMSUB(D, DiffX, LargeMask, Zero, One); + */ + vblendmpd MOne+__svml_datan_data_internal_avx512(%rip), %zmm6, %zmm10{%k2} + vfmadd231pd {rn-sae}, %zmm8, %zmm5, %zmm9 + vmovups dIndexMed+__svml_datan_data_internal_avx512(%rip), %zmm5 + +/* table lookup sequence */ + vmovups Tbl_H+__svml_datan_data_internal_avx512(%rip), %zmm6 + vgetmantpd $0, {sae}, %zmm10, %zmm14 + vgetexppd {sae}, %zmm10, %zmm11 + vmovups coeff_5+__svml_datan_data_internal_avx512(%rip), %zmm10 + +/* + * if|X|>=MaxThreshold, set Y=X + * VMADD(D, Y, LargeMask, X, Zero); + */ + vminpd {sae}, %zmm8, %zmm7, %zmm9{%k2} + vcmppd $29, {sae}, %zmm5, %zmm2, %k1 + vmovups Tbl_H+128+__svml_datan_data_internal_avx512(%rip), %zmm7 + vmovups coeff_1+__svml_datan_data_internal_avx512(%rip), %zmm8 + vgetmantpd $0, {sae}, %zmm9, %zmm3 + vgetexppd {sae}, %zmm9, %zmm12 + vmovups coeff_3+__svml_datan_data_internal_avx512(%rip), %zmm9 + vpermt2pd Tbl_H+64+__svml_datan_data_internal_avx512(%rip), %zmm2, %zmm6 + vsubpd {rn-sae}, %zmm12, %zmm11, %zmm4 + vpermt2pd Tbl_H+192+__svml_datan_data_internal_avx512(%rip), %zmm2, %zmm7 + vrcp14pd %zmm3, %zmm13 + vmovups coeff_4+__svml_datan_data_internal_avx512(%rip), %zmm12 + vmovups coeff_6+__svml_datan_data_internal_avx512(%rip), %zmm11 + vblendmpd %zmm7, %zmm6, %zmm2{%k1} + vmulpd {rn-sae}, %zmm13, %zmm14, %zmm0 + vfnmadd231pd {rn-sae}, %zmm3, %zmm13, %zmm15 + vfnmadd213pd {rn-sae}, %zmm14, %zmm0, %zmm3 + vfmadd213pd {rn-sae}, %zmm15, %zmm15, %zmm15 + vfmadd213pd {rn-sae}, %zmm13, %zmm13, %zmm15 + vfmadd213pd {rn-sae}, %zmm0, %zmm15, %zmm3 + vscalefpd {rn-sae}, %zmm4, %zmm3, %zmm0 + +/* set table value to Pi/2 for large X */ + vblendmpd Pi2+__svml_datan_data_internal_avx512(%rip), %zmm2, %zmm3{%k2} + vmovups coeff_2+__svml_datan_data_internal_avx512(%rip), %zmm2 + +/* polynomial evaluation */ + vmulpd {rn-sae}, %zmm0, %zmm0, %zmm14 + vmulpd {rn-sae}, %zmm14, %zmm14, %zmm13 + vmulpd {rn-sae}, %zmm0, %zmm14, %zmm15 + vfmadd231pd {rn-sae}, %zmm14, %zmm8, %zmm2 + vfmadd231pd {rn-sae}, %zmm14, %zmm9, %zmm12 + vfmadd213pd {rn-sae}, %zmm11, %zmm10, %zmm14 + vfmadd213pd {rn-sae}, %zmm12, %zmm13, %zmm2 + vfmadd213pd {rn-sae}, %zmm14, %zmm13, %zmm2 + vfmadd213pd {rn-sae}, %zmm0, %zmm15, %zmm2 + vaddpd {rn-sae}, %zmm3, %zmm2, %zmm0 + vxorpd %zmm1, %zmm0, %zmm0 + ret + +END(_ZGVeN8v_atan_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_datan_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 AbsMask[8][2]; + __declspec(align(64)) VUINT32 Shifter[8][2]; + __declspec(align(64)) VUINT32 MaxThreshold[8][2]; + __declspec(align(64)) VUINT32 MOne[8][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 LargeX[8][2]; + __declspec(align(64)) VUINT32 Zero[8][2]; + __declspec(align(64)) VUINT32 Tbl_H[32][2]; + __declspec(align(64)) VUINT32 dIndexMed[8][2]; + __declspec(align(64)) VUINT32 Pi2[8][2]; + __declspec(align(64)) VUINT32 coeff[6][8][2]; + } __svml_datan_data_internal_avx512; +#endif +__svml_datan_data_internal_avx512: + /*== AbsMask ==*/ + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== Shifter ==*/ + .align 64 + .quad 0x4318000000000000, 0x4318000000000000, 0x4318000000000000, 0x4318000000000000, 0x4318000000000000, 0x4318000000000000, 0x4318000000000000, 0x4318000000000000 + /*== MaxThreshold ==*/ + .align 64 + .quad 0x401f800000000000, 0x401f800000000000, 0x401f800000000000, 0x401f800000000000, 0x401f800000000000, 0x401f800000000000, 0x401f800000000000, 0x401f800000000000 + /*== MOne ==*/ + .align 64 + .quad 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== LargeX ==*/ + .align 64 + .quad 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000, 0x47f0000000000000 + /*== Zero ==*/ + .align 64 + .quad 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000 + /*== Tbl_H ==*/ + .align 64 + .quad 0x0000000000000000, 0x3fcf5b75f92c80dd + .quad 0x3fddac670561bb4f, 0x3fe4978fa3269ee1 + .quad 0x3fe921fb54442d18, 0x3fecac7c57846f9e + .quad 0x3fef730bd281f69b, 0x3ff0d38f2c5ba09f + .quad 0x3ff1b6e192ebbe44, 0x3ff270ef55a53a25 + .quad 0x3ff30b6d796a4da8, 0x3ff38d6a6ce13353 + .quad 0x3ff3fc176b7a8560, 0x3ff45b54837351a0 + .quad 0x3ff4ae10fc6589a5, 0x3ff4f68dea672617 + .quad 0x3ff5368c951e9cfd, 0x3ff56f6f33a3e6a7 + .quad 0x3ff5a25052114e60, 0x3ff5d013c41adabd + .quad 0x3ff5f97315254857, 0x3ff61f06c6a92b89 + .quad 0x3ff6414d44094c7c, 0x3ff660b02c736a06 + .quad 0x3ff67d8863bc99bd, 0x3ff698213a9d5053 + .quad 0x3ff6b0bae830c070, 0x3ff6c78c7edeb195 + .quad 0x3ff6dcc57bb565fd, 0x3ff6f08f07435fec + .quad 0x3ff7030cf9403197, 0x3ff7145eac2088a4 + /*== dIndexMed ==*/ + .align 64 + .quad 0x4318000000000010, 0x4318000000000010, 0x4318000000000010, 0x4318000000000010, 0x4318000000000010, 0x4318000000000010, 0x4318000000000010, 0x4318000000000010 + /*== Pi2 ==*/ + .align 64 + .quad 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18 + /*== coeff6 ==*/ + .align 64 + .quad 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97, 0x3fb2e9b9f5c4fe97 + .quad 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc, 0xbfb74257c46790cc + .quad 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0, 0x3fbc71bfeff916a0 + .quad 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da, 0xbfc249248eef04da + .quad 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e, 0x3fc999999998741e + .quad 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d, 0xbfd555555555554d + .align 64 + .type __svml_datan_data_internal_avx512,@object + .size __svml_datan_data_internal_avx512,.-__svml_datan_data_internal_avx512 + .align 8 + +.FLT_10: + .long 0x00000000,0x3ff00000 + .type .FLT_10,@object + .size .FLT_10,8 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core-avx2.S new file mode 100644 index 0000000000..27623cdf16 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized atanf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_atanf _ZGVeN16v_atanf_avx2_wrapper +#include "../svml_s_atanf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core.c new file mode 100644 index 0000000000..940de26615 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atanf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_atanf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_atanf, __GI__ZGVeN16v_atanf, + __redirect__ZGVeN16v_atanf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S new file mode 100644 index 0000000000..4a37f03e69 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf16_core_avx512.S @@ -0,0 +1,174 @@ +/* Function atanf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + */ + +/* Offsets for data table __svml_satan_data_internal_avx512 + */ +#define AbsMask 0 +#define Shifter 64 +#define MaxThreshold 128 +#define MOne 192 +#define One 256 +#define LargeX 320 +#define Zero 384 +#define Tbl_H 448 +#define Pi2 576 +#define coeff_1 640 +#define coeff_2 704 +#define coeff_3 768 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_atanf_skx) + vandps __svml_satan_data_internal_avx512(%rip), %zmm0, %zmm7 + vmovups MaxThreshold+__svml_satan_data_internal_avx512(%rip), %zmm3 + vmovups One+__svml_satan_data_internal_avx512(%rip), %zmm8 + +/* round to 2 bits after binary point */ + vreduceps $40, {sae}, %zmm7, %zmm5 + +/* saturate X range */ + vmovups LargeX+__svml_satan_data_internal_avx512(%rip), %zmm6 + vmovups Shifter+__svml_satan_data_internal_avx512(%rip), %zmm2 + vcmpps $29, {sae}, %zmm3, %zmm7, %k1 + +/* table lookup sequence */ + vmovups Tbl_H+__svml_satan_data_internal_avx512(%rip), %zmm3 + vsubps {rn-sae}, %zmm5, %zmm7, %zmm4 + vaddps {rn-sae}, %zmm2, %zmm7, %zmm1 + vxorps %zmm0, %zmm7, %zmm0 + vfmadd231ps {rn-sae}, %zmm7, %zmm4, %zmm8 + vmovups coeff_2+__svml_satan_data_internal_avx512(%rip), %zmm4 + +/* if|X|>=MaxThreshold, set DiffX=-1 */ + vblendmps MOne+__svml_satan_data_internal_avx512(%rip), %zmm5, %zmm9{%k1} + vmovups coeff_3+__svml_satan_data_internal_avx512(%rip), %zmm5 + +/* if|X|>=MaxThreshold, set Y=X */ + vminps {sae}, %zmm7, %zmm6, %zmm8{%k1} + +/* R+Rl = DiffX/Y */ + vgetmantps $0, {sae}, %zmm9, %zmm12 + vgetexpps {sae}, %zmm9, %zmm10 + vpermt2ps Tbl_H+64+__svml_satan_data_internal_avx512(%rip), %zmm1, %zmm3 + vgetmantps $0, {sae}, %zmm8, %zmm15 + vgetexpps {sae}, %zmm8, %zmm11 + vmovups coeff_1+__svml_satan_data_internal_avx512(%rip), %zmm1 + +/* set table value to Pi/2 for large X */ + vblendmps Pi2+__svml_satan_data_internal_avx512(%rip), %zmm3, %zmm9{%k1} + vrcp14ps %zmm15, %zmm13 + vsubps {rn-sae}, %zmm11, %zmm10, %zmm2 + vmulps {rn-sae}, %zmm13, %zmm12, %zmm14 + vfnmadd213ps {rn-sae}, %zmm12, %zmm14, %zmm15 + vfmadd213ps {rn-sae}, %zmm14, %zmm13, %zmm15 + vscalefps {rn-sae}, %zmm2, %zmm15, %zmm7 + +/* polynomial evaluation */ + vmulps {rn-sae}, %zmm7, %zmm7, %zmm8 + vmulps {rn-sae}, %zmm7, %zmm8, %zmm6 + vfmadd231ps {rn-sae}, %zmm8, %zmm1, %zmm4 + vfmadd213ps {rn-sae}, %zmm5, %zmm4, %zmm8 + vfmadd213ps {rn-sae}, %zmm7, %zmm6, %zmm8 + vaddps {rn-sae}, %zmm9, %zmm8, %zmm10 + vxorps %zmm0, %zmm10, %zmm0 + ret + +END(_ZGVeN16v_atanf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_satan_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 AbsMask[16][1]; + __declspec(align(64)) VUINT32 Shifter[16][1]; + __declspec(align(64)) VUINT32 MaxThreshold[16][1]; + __declspec(align(64)) VUINT32 MOne[16][1]; + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 LargeX[16][1]; + __declspec(align(64)) VUINT32 Zero[16][1]; + __declspec(align(64)) VUINT32 Tbl_H[32][1]; + __declspec(align(64)) VUINT32 Pi2[16][1]; + __declspec(align(64)) VUINT32 coeff[3][16][1]; + } __svml_satan_data_internal_avx512; +#endif +__svml_satan_data_internal_avx512: + /*== AbsMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== Shifter ==*/ + .align 64 + .long 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000, 0x4a000000 + /*== MaxThreshold ==*/ + .align 64 + .long 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000, 0x40F80000 + /*== MOne ==*/ + .align 64 + .long 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000 + /*== One ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== LargeX ==*/ + .align 64 + .long 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000, 0x4f800000 + /*== Zero ==*/ + .align 64 + .long 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 + /*== Tbl_H ==*/ + .align 64 + .long 0x00000000, 0x3e7adbb0 + .long 0x3eed6338, 0x3f24bc7d + .long 0x3f490fdb, 0x3f6563e3 + .long 0x3f7b985f, 0x3f869c79 + .long 0x3f8db70d, 0x3f93877b + .long 0x3f985b6c, 0x3f9c6b53 + .long 0x3f9fe0bb, 0x3fa2daa4 + .long 0x3fa57088, 0x3fa7b46f + .long 0x3fa9b465, 0x3fab7b7a + .long 0x3fad1283, 0x3fae809e + .long 0x3fafcb99, 0x3fb0f836 + .long 0x3fb20a6a, 0x3fb30581 + .long 0x3fb3ec43, 0x3fb4c10a + .long 0x3fb585d7, 0x3fb63c64 + .long 0x3fb6e62c, 0x3fb78478 + .long 0x3fb81868, 0x3fb8a2f5 + /*== Pi2 ==*/ + .align 64 + .long 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB + /*== coeff3 ==*/ + .align 64 + .long 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de, 0xbe0fa8de + .long 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2, 0x3e4cc8e2 + .long 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa, 0xbeaaaaaa + .align 64 + .type __svml_satan_data_internal_avx512,@object + .size __svml_satan_data_internal_avx512,.-__svml_satan_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core-sse2.S new file mode 100644 index 0000000000..fe81170666 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized atanf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_atanf _ZGVbN4v_atanf_sse2 +#include "../svml_s_atanf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core.c new file mode 100644 index 0000000000..975ece6812 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atanf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_atanf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_atanf, __GI__ZGVbN4v_atanf, + __redirect__ZGVbN4v_atanf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S new file mode 100644 index 0000000000..c58a894e10 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf4_core_sse4.S @@ -0,0 +1,164 @@ +/* Function atanf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + */ + +/* Offsets for data table __svml_satan_data_internal + */ +#define _sSIGN_MASK 0 +#define _sABS_MASK 16 +#define _sONE 32 +#define _sPIO2 48 +#define _sPC8 64 +#define _sPC7 80 +#define _sPC6 96 +#define _sPC5 112 +#define _sPC4 128 +#define _sPC3 144 +#define _sPC2 160 +#define _sPC1 176 +#define _sPC0 192 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_atanf_sse4) +/* + * To use minps\maxps operations for argument reduction + * uncomment _AT_USEMINMAX_ definition + * Declarations + * Variables + * Constants + */ + movups _sABS_MASK+__svml_satan_data_internal(%rip), %xmm2 + +/* + * 1) If x>1, then r=-1/x, PIO2=Pi/2 + * 2) If -1<=x<=1, then r=x, PIO2=0 + * 3) If x<-1, then r=-1/x, PIO2=-Pi/2 + */ + movups _sONE+__svml_satan_data_internal(%rip), %xmm1 + andps %xmm0, %xmm2 + movaps %xmm2, %xmm9 + movaps %xmm1, %xmm3 + cmpleps %xmm1, %xmm9 + maxps %xmm2, %xmm3 + minps %xmm2, %xmm1 + divps %xmm3, %xmm1 + movups __svml_satan_data_internal(%rip), %xmm4 + movaps %xmm9, %xmm10 + andps %xmm4, %xmm0 + andnps %xmm4, %xmm9 + pxor %xmm0, %xmm9 + pxor %xmm1, %xmm9 + +/* Polynomial. */ + movaps %xmm9, %xmm8 + mulps %xmm9, %xmm8 + movaps %xmm8, %xmm7 + mulps %xmm8, %xmm7 + movups _sPC8+__svml_satan_data_internal(%rip), %xmm6 + mulps %xmm7, %xmm6 + movups _sPC7+__svml_satan_data_internal(%rip), %xmm5 + mulps %xmm7, %xmm5 + addps _sPC6+__svml_satan_data_internal(%rip), %xmm6 + mulps %xmm7, %xmm6 + addps _sPC5+__svml_satan_data_internal(%rip), %xmm5 + mulps %xmm7, %xmm5 + addps _sPC4+__svml_satan_data_internal(%rip), %xmm6 + mulps %xmm7, %xmm6 + addps _sPC3+__svml_satan_data_internal(%rip), %xmm5 + mulps %xmm5, %xmm7 + addps _sPC2+__svml_satan_data_internal(%rip), %xmm6 + mulps %xmm8, %xmm6 + addps _sPC1+__svml_satan_data_internal(%rip), %xmm7 + andnps _sPIO2+__svml_satan_data_internal(%rip), %xmm10 + addps %xmm6, %xmm7 + mulps %xmm7, %xmm8 + pxor %xmm0, %xmm10 + addps _sPC0+__svml_satan_data_internal(%rip), %xmm8 + +/* Reconstruction. */ + mulps %xmm8, %xmm9 + addps %xmm9, %xmm10 + movaps %xmm10, %xmm0 + ret + +END(_ZGVbN4v_atanf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_satan_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 _sSIGN_MASK[4][1]; + __declspec(align(16)) VUINT32 _sABS_MASK[4][1]; + __declspec(align(16)) VUINT32 _sONE[4][1]; + __declspec(align(16)) VUINT32 _sPIO2[4][1]; + __declspec(align(16)) VUINT32 _sPC8[4][1]; + __declspec(align(16)) VUINT32 _sPC7[4][1]; + __declspec(align(16)) VUINT32 _sPC6[4][1]; + __declspec(align(16)) VUINT32 _sPC5[4][1]; + __declspec(align(16)) VUINT32 _sPC4[4][1]; + __declspec(align(16)) VUINT32 _sPC3[4][1]; + __declspec(align(16)) VUINT32 _sPC2[4][1]; + __declspec(align(16)) VUINT32 _sPC1[4][1]; + __declspec(align(16)) VUINT32 _sPC0[4][1]; +} __svml_satan_data_internal; +#endif +__svml_satan_data_internal: + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 //_sSIGN_MASK + .align 16 + .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF //_sABS_MASK + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 //_sONE + .align 16 + .long 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB //_sPIO2 + .align 16 + .long 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0 //_sPC8 + .align 16 + .long 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631 //_sPC7 + .align 16 + .long 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384 //_sPC6 + .align 16 + .long 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629 //_sPC5 + .align 16 + .long 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474 //_sPC4 + .align 16 + .long 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8 //_sPC3 + .align 16 + .long 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F //_sPC2 + .align 16 + .long 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49 //_sPC1 + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 //_sPC0 + .align 16 + .type __svml_satan_data_internal,@object + .size __svml_satan_data_internal,.-__svml_satan_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core-sse.S new file mode 100644 index 0000000000..1652a8f5c6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized atanf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_atanf _ZGVdN8v_atanf_sse_wrapper +#include "../svml_s_atanf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core.c new file mode 100644 index 0000000000..733d8c3bc3 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atanf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_atanf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_atanf, __GI__ZGVdN8v_atanf, + __redirect__ZGVdN8v_atanf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core_avx2.S new file mode 100644 index 0000000000..e333f979c4 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanf8_core_avx2.S @@ -0,0 +1,148 @@ +/* Function atanf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + */ + +/* Offsets for data table __svml_satan_data_internal + */ +#define _sSIGN_MASK 0 +#define _sABS_MASK 32 +#define _sONE 64 +#define _sPIO2 96 +#define _sPC8 128 +#define _sPC7 160 +#define _sPC6 192 +#define _sPC5 224 +#define _sPC4 256 +#define _sPC3 288 +#define _sPC2 320 +#define _sPC1 352 +#define _sPC0 384 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_atanf_avx2) +/* + * 1) If x>1, then r=-1/x, PIO2=Pi/2 + * 2) If -1<=x<=1, then r=x, PIO2=0 + * 3) If x<-1, then r=-1/x, PIO2=-Pi/2 + */ + vmovups _sONE+__svml_satan_data_internal(%rip), %ymm2 + vmovups __svml_satan_data_internal(%rip), %ymm7 + vmovups _sPC7+__svml_satan_data_internal(%rip), %ymm13 + +/* + * To use minps\maxps operations for argument reduction + * uncomment _AT_USEMINMAX_ definition + * Declarations + * Variables + * Constants + */ + vandps _sABS_MASK+__svml_satan_data_internal(%rip), %ymm0, %ymm3 + vmaxps %ymm3, %ymm2, %ymm5 + vminps %ymm3, %ymm2, %ymm4 + vcmple_oqps %ymm2, %ymm3, %ymm6 + vdivps %ymm5, %ymm4, %ymm11 + vandps %ymm7, %ymm0, %ymm9 + vandnps %ymm7, %ymm6, %ymm8 + vxorps %ymm9, %ymm8, %ymm10 + vxorps %ymm11, %ymm10, %ymm15 + +/* Polynomial. */ + vmulps %ymm15, %ymm15, %ymm14 + vmovups _sPC8+__svml_satan_data_internal(%rip), %ymm0 + vmulps %ymm14, %ymm14, %ymm12 + vfmadd213ps _sPC6+__svml_satan_data_internal(%rip), %ymm12, %ymm0 + vfmadd213ps _sPC5+__svml_satan_data_internal(%rip), %ymm12, %ymm13 + vfmadd213ps _sPC4+__svml_satan_data_internal(%rip), %ymm12, %ymm0 + vfmadd213ps _sPC3+__svml_satan_data_internal(%rip), %ymm12, %ymm13 + vfmadd213ps _sPC2+__svml_satan_data_internal(%rip), %ymm12, %ymm0 + vfmadd213ps _sPC1+__svml_satan_data_internal(%rip), %ymm12, %ymm13 + vfmadd213ps %ymm13, %ymm14, %ymm0 + vfmadd213ps _sPC0+__svml_satan_data_internal(%rip), %ymm14, %ymm0 + vandnps _sPIO2+__svml_satan_data_internal(%rip), %ymm6, %ymm1 + vxorps %ymm9, %ymm1, %ymm1 + +/* Reconstruction. */ + vfmadd213ps %ymm1, %ymm15, %ymm0 + ret + +END(_ZGVdN8v_atanf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_satan_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 _sSIGN_MASK[8][1]; + __declspec(align(32)) VUINT32 _sABS_MASK[8][1]; + __declspec(align(32)) VUINT32 _sONE[8][1]; + __declspec(align(32)) VUINT32 _sPIO2[8][1]; + __declspec(align(32)) VUINT32 _sPC8[8][1]; + __declspec(align(32)) VUINT32 _sPC7[8][1]; + __declspec(align(32)) VUINT32 _sPC6[8][1]; + __declspec(align(32)) VUINT32 _sPC5[8][1]; + __declspec(align(32)) VUINT32 _sPC4[8][1]; + __declspec(align(32)) VUINT32 _sPC3[8][1]; + __declspec(align(32)) VUINT32 _sPC2[8][1]; + __declspec(align(32)) VUINT32 _sPC1[8][1]; + __declspec(align(32)) VUINT32 _sPC0[8][1]; +} __svml_satan_data_internal; +#endif +__svml_satan_data_internal: + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 //_sSIGN_MASK + .align 32 + .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF //_sABS_MASK + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 //_sONE + .align 32 + .long 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB //_sPIO2 + .align 32 + .long 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0 //_sPC8 + .align 32 + .long 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631 //_sPC7 + .align 32 + .long 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384 //_sPC6 + .align 32 + .long 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629 //_sPC5 + .align 32 + .long 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474 //_sPC4 + .align 32 + .long 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8 //_sPC3 + .align 32 + .long 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F //_sPC2 + .align 32 + .long 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49 //_sPC1 + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 //_sPC0 + .align 32 + .type __svml_satan_data_internal,@object + .size __svml_satan_data_internal,.-__svml_satan_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_atan2_core.S b/sysdeps/x86_64/fpu/svml_d_atan2_core.S new file mode 100644 index 0000000000..e86d5b7047 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan2_core.S @@ -0,0 +1,29 @@ +/* Function atan vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_atan) +WRAPPER_IMPL_SSE2 atan +END (_ZGVbN2v_atan) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_atan) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_atan4_core.S b/sysdeps/x86_64/fpu/svml_d_atan4_core.S new file mode 100644 index 0000000000..eb11fd2f17 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan4_core.S @@ -0,0 +1,29 @@ +/* Function atan vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_atan) +WRAPPER_IMPL_AVX _ZGVbN2v_atan +END (_ZGVdN4v_atan) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_atan) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_atan4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_atan4_core_avx.S new file mode 100644 index 0000000000..b83a4be33d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan4_core_avx.S @@ -0,0 +1,25 @@ +/* Function atan vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_atan) +WRAPPER_IMPL_AVX _ZGVbN2v_atan +END (_ZGVcN4v_atan) diff --git a/sysdeps/x86_64/fpu/svml_d_atan8_core.S b/sysdeps/x86_64/fpu/svml_d_atan8_core.S new file mode 100644 index 0000000000..9685a32bdc --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan8_core.S @@ -0,0 +1,25 @@ +/* Function atan vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_atan) +WRAPPER_IMPL_AVX512 _ZGVdN4v_atan +END (_ZGVeN8v_atan) diff --git a/sysdeps/x86_64/fpu/svml_s_atanf16_core.S b/sysdeps/x86_64/fpu/svml_s_atanf16_core.S new file mode 100644 index 0000000000..f82d2422ae --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanf16_core.S @@ -0,0 +1,25 @@ +/* Function atanf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_atanf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_atanf +END (_ZGVeN16v_atanf) diff --git a/sysdeps/x86_64/fpu/svml_s_atanf4_core.S b/sysdeps/x86_64/fpu/svml_s_atanf4_core.S new file mode 100644 index 0000000000..6b8c4d9624 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanf4_core.S @@ -0,0 +1,29 @@ +/* Function atanf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_atanf) +WRAPPER_IMPL_SSE2 atanf +END (_ZGVbN4v_atanf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_atanf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_atanf8_core.S b/sysdeps/x86_64/fpu/svml_s_atanf8_core.S new file mode 100644 index 0000000000..315681f6c0 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanf8_core.S @@ -0,0 +1,29 @@ +/* Function atanf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_atanf) +WRAPPER_IMPL_AVX _ZGVbN4v_atanf +END (_ZGVdN8v_atanf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_atanf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_atanf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_atanf8_core_avx.S new file mode 100644 index 0000000000..b9cd502186 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function atanf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_atanf) +WRAPPER_IMPL_AVX _ZGVbN4v_atanf +END (_ZGVcN8v_atanf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx.c new file mode 100644 index 0000000000..0f7176a20b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atan.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx2.c new file mode 100644 index 0000000000..0f7176a20b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atan.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx512f.c new file mode 100644 index 0000000000..0f7176a20b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atan.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan.c new file mode 100644 index 0000000000..982687b169 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC atan +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 0abc7d2021..467c913990 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -28,6 +28,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log), _ZGVbN2v_log) VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVbN2v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVbN2vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVbN2v_acos) +VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVbN2v_atan) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index dda093b914..b72a7de84e 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log), _ZGVdN4v_log) VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVdN4v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVdN4vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVdN4v_acos) +VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVdN4v_atan) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index f3230463bb..d2434df21e 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -28,6 +28,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log), _ZGVcN4v_log) VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVcN4v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVcN4vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVcN4v_acos) +VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVcN4v_atan) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index cf9f52faf0..f7aaf8159e 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -28,6 +28,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log), _ZGVeN8v_log) VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVeN8v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVeN8vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVeN8v_acos) +VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVeN8v_atan) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx.c new file mode 100644 index 0000000000..9251c65f8a --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atanf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx2.c new file mode 100644 index 0000000000..9251c65f8a --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atanf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx512f.c new file mode 100644 index 0000000000..9251c65f8a --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atanf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanf.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanf.c new file mode 100644 index 0000000000..2a8ab87e86 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC atanf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index abbd3ed870..af769c56fa 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -28,6 +28,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (logf), _ZGVeN16v_logf) VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVeN16v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVeN16vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVeN16v_acosf) +VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVeN16v_atanf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 8a24027952..76e61d2f1e 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -28,6 +28,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (logf), _ZGVbN4v_logf) VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVbN4v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVbN4vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVbN4v_acosf) +VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVbN4v_atanf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index aff0442606..5e27eaaf29 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (logf), _ZGVdN8v_logf) VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVdN8v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVdN8vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVdN8v_acosf) +VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVdN8v_atanf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 913584d111..28daf79aa9 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -28,6 +28,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (logf), _ZGVcN8v_logf) VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVcN8v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVcN8vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVcN8v_acosf) +VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVcN8v_atanf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573814 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=qdseAEO+; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm3D2zJbz9sVq for ; Wed, 29 Dec 2021 07:14:32 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 29D09385843B for ; Tue, 28 Dec 2021 20:14:30 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 29D09385843B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722470; bh=Net1zTghgzYm4IF7kVRo3TeP0qt/Gw07RrdEw47GGlM=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=qdseAEO+OLgmBjax+Mbgv9YtEMTdIPtJyXTzwOt4MkCwgpXPEDgCLSOUFSIoYy1RL 2kJxikbgnVa544HK14TUnMKyvMyFnmyXPDOFqC5dja8Aku+353oM8lCeW/EH5pL10c EqGh9FDO/+UWeD5+FCS62HgdmfIBjCgaZG7+idx4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id CA4C93858419 for ; Tue, 28 Dec 2021 20:11:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CA4C93858419 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="228246027" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="228246027" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="510272249" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga007.jf.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsW016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 02/18] x86-64: Add vector asin/asinf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:14 -0800 Message-Id: <20211228201130.737370-3-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized asin/asinf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector asin/asinf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 ++ .../fpu/multiarch/svml_d_asin2_core-sse2.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_asin2_core.c | 27 ++ .../fpu/multiarch/svml_d_asin2_core_sse4.S | 288 +++++++++++++++++ .../fpu/multiarch/svml_d_asin4_core-sse.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_asin4_core.c | 27 ++ .../fpu/multiarch/svml_d_asin4_core_avx2.S | 273 ++++++++++++++++ .../fpu/multiarch/svml_d_asin8_core-avx2.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_asin8_core.c | 27 ++ .../fpu/multiarch/svml_d_asin8_core_avx512.S | 295 ++++++++++++++++++ .../fpu/multiarch/svml_s_asinf16_core-avx2.S | 20 ++ .../fpu/multiarch/svml_s_asinf16_core.c | 28 ++ .../multiarch/svml_s_asinf16_core_avx512.S | 260 +++++++++++++++ .../fpu/multiarch/svml_s_asinf4_core-sse2.S | 20 ++ .../x86_64/fpu/multiarch/svml_s_asinf4_core.c | 28 ++ .../fpu/multiarch/svml_s_asinf4_core_sse4.S | 252 +++++++++++++++ .../fpu/multiarch/svml_s_asinf8_core-sse.S | 20 ++ .../x86_64/fpu/multiarch/svml_s_asinf8_core.c | 28 ++ .../fpu/multiarch/svml_s_asinf8_core_avx2.S | 249 +++++++++++++++ sysdeps/x86_64/fpu/svml_d_asin2_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_asin4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_asin4_core_avx.S | 25 ++ sysdeps/x86_64/fpu/svml_d_asin8_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_asinf16_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_asinf4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_asinf8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_asinf8_core_avx.S | 25 ++ .../x86_64/fpu/test-double-libmvec-asin-avx.c | 1 + .../fpu/test-double-libmvec-asin-avx2.c | 1 + .../fpu/test-double-libmvec-asin-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-asin.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-asinf-avx.c | 1 + .../fpu/test-float-libmvec-asinf-avx2.c | 1 + .../fpu/test-float-libmvec-asinf-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-asinf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 2189 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asin2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asin4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asin4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asin8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asin-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asin-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asin-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asin.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index b4647ca918..ae8ee882d0 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -120,4 +120,15 @@ #define __DECL_SIMD_atanf32x #define __DECL_SIMD_atanf64x #define __DECL_SIMD_atanf128x + +#define __DECL_SIMD_asin +#define __DECL_SIMD_asinf +#define __DECL_SIMD_asinl +#define __DECL_SIMD_asinf16 +#define __DECL_SIMD_asinf32 +#define __DECL_SIMD_asinf64 +#define __DECL_SIMD_asinf128 +#define __DECL_SIMD_asinf32x +#define __DECL_SIMD_asinf64x +#define __DECL_SIMD_asinf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 3e27c21f21..bb53b7021e 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -52,7 +52,7 @@ /* Arc cosine of X. */ __MATHCALL_VEC (acos,, (_Mdouble_ __x)); /* Arc sine of X. */ -__MATHCALL (asin,, (_Mdouble_ __x)); +__MATHCALL_VEC (asin,, (_Mdouble_ __x)); /* Arc tangent of X. */ __MATHCALL_VEC (atan,, (_Mdouble_ __x)); /* Arc tangent of Y/X. */ diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index a93258db6f..ab03a07f92 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -47,18 +47,26 @@ GLIBC_2.22 _ZGVeN8v_sin F GLIBC_2.22 _ZGVeN8vv_pow F GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F +GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F GLIBC_2.35 _ZGVbN4v_acosf F +GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F GLIBC_2.35 _ZGVcN4v_acos F +GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F GLIBC_2.35 _ZGVcN8v_acosf F +GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F GLIBC_2.35 _ZGVdN4v_acos F +GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F GLIBC_2.35 _ZGVdN8v_acosf F +GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F GLIBC_2.35 _ZGVeN16v_acosf F +GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F GLIBC_2.35 _ZGVeN8v_acos F +GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 1c0e5c5e35..73cb8849ff 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -66,6 +66,10 @@ # define __DECL_SIMD_atan __DECL_SIMD_x86_64 # undef __DECL_SIMD_atanf # define __DECL_SIMD_atanf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_asin +# define __DECL_SIMD_asin __DECL_SIMD_x86_64 +# undef __DECL_SIMD_asinf +# define __DECL_SIMD_asinf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index ddcccb11d7..4552c2bdfa 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -32,6 +32,8 @@ !GCC$ builtin (acosf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (atan) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (atanf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (asin) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (asinf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -49,3 +51,5 @@ !GCC$ builtin (acosf) attributes simd (notinbranch) if('x32') !GCC$ builtin (atan) attributes simd (notinbranch) if('x32') !GCC$ builtin (atanf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (asin) attributes simd (notinbranch) if('x32') +!GCC$ builtin (asinf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index dae0887f13..e0eae0b196 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -23,6 +23,7 @@ postclean-generated += libmvec.mk # Define for both math and mathvec directories. libmvec-funcs = \ acos \ + asin \ atan \ cos \ exp \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 424f6d526e..10baf869a5 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -15,8 +15,10 @@ libmvec { } GLIBC_2.35 { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; + _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; + _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; } } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index 2e64e59803..ea0f833381 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -93,6 +93,26 @@ float: 1 float128: 2 ldouble: 1 +Function: "asin_vlen16": +float: 1 + +Function: "asin_vlen2": +double: 1 + +Function: "asin_vlen4": +double: 1 +float: 1 + +Function: "asin_vlen4_avx2": +double: 1 + +Function: "asin_vlen8": +double: 1 +float: 1 + +Function: "asin_vlen8_avx2": +float: 1 + Function: "asinh": double: 2 float: 2 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core-sse2.S new file mode 100644 index 0000000000..57e1d41a7b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized asin, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_asin _ZGVbN2v_asin_sse2 +#include "../svml_d_asin2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core.c new file mode 100644 index 0000000000..e46c3af81e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized asin, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_asin +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_asin, __GI__ZGVbN2v_asin, __redirect__ZGVbN2v_asin) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core_sse4.S new file mode 100644 index 0000000000..a6f7a41623 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin2_core_sse4.S @@ -0,0 +1,288 @@ +/* Function asin vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * SelMask = (|x| >= 0.5) ? 1 : 0; + * R = SelMask ? sqrt(0.5 - 0.5*|x|) : |x| + * asin(x) = (SelMask ? (Pi/2 - 2*Poly(R)) : Poly(R))*(-1)^sign(x) + * + */ + +/* Offsets for data table __svml_dasin_data_internal + */ +#define AbsMask 0 +#define OneHalf 16 +#define SmallNorm 32 +#define One 48 +#define Two 64 +#define sqrt_coeff 80 +#define poly_coeff 144 +#define Pi2H 336 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_asin_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm5 + movups __svml_dasin_data_internal(%rip), %xmm3 + movups OneHalf+__svml_dasin_data_internal(%rip), %xmm8 + +/* x = |arg| */ + movaps %xmm3, %xmm4 + andps %xmm5, %xmm4 + +/* Y = 0.5 - 0.5*x */ + movaps %xmm8, %xmm6 + mulpd %xmm4, %xmm6 + movaps %xmm8, %xmm14 + +/* x^2 */ + movaps %xmm4, %xmm2 + subpd %xmm6, %xmm14 + mulpd %xmm4, %xmm2 + +/* S ~ -2*sqrt(Y) */ + cvtpd2ps %xmm14, %xmm9 + minpd %xmm14, %xmm2 + movlhps %xmm9, %xmm9 + movaps %xmm14, %xmm15 + rsqrtps %xmm9, %xmm10 + cmpltpd SmallNorm+__svml_dasin_data_internal(%rip), %xmm15 + addpd %xmm14, %xmm14 + cvtps2pd %xmm10, %xmm11 + andnps %xmm11, %xmm15 + movaps %xmm4, %xmm1 + movaps %xmm15, %xmm12 + andnps %xmm5, %xmm3 + mulpd %xmm15, %xmm12 + mulpd %xmm14, %xmm15 + mulpd %xmm12, %xmm14 + cmpnltpd %xmm8, %xmm1 + subpd Two+__svml_dasin_data_internal(%rip), %xmm14 + +/* polynomial */ + movups poly_coeff+__svml_dasin_data_internal(%rip), %xmm6 + movaps %xmm2, %xmm12 + mulpd %xmm2, %xmm6 + mulpd %xmm2, %xmm12 + addpd poly_coeff+16+__svml_dasin_data_internal(%rip), %xmm6 + movups One+__svml_dasin_data_internal(%rip), %xmm7 + movaps %xmm12, %xmm8 + cmpltpd %xmm4, %xmm7 + mulpd %xmm12, %xmm6 + movmskpd %xmm7, %edx + movups poly_coeff+32+__svml_dasin_data_internal(%rip), %xmm9 + movaps %xmm14, %xmm0 + movups poly_coeff+64+__svml_dasin_data_internal(%rip), %xmm7 + mulpd %xmm2, %xmm9 + mulpd %xmm2, %xmm7 + addpd poly_coeff+48+__svml_dasin_data_internal(%rip), %xmm9 + addpd poly_coeff+80+__svml_dasin_data_internal(%rip), %xmm7 + mulpd %xmm12, %xmm8 + mulpd %xmm12, %xmm7 + addpd %xmm6, %xmm9 + mulpd %xmm15, %xmm0 + mulpd %xmm8, %xmm9 + movups poly_coeff+96+__svml_dasin_data_internal(%rip), %xmm10 + mulpd %xmm2, %xmm10 + movups sqrt_coeff+__svml_dasin_data_internal(%rip), %xmm13 + mulpd %xmm14, %xmm13 + addpd poly_coeff+112+__svml_dasin_data_internal(%rip), %xmm10 + addpd sqrt_coeff+16+__svml_dasin_data_internal(%rip), %xmm13 + addpd %xmm7, %xmm10 + mulpd %xmm14, %xmm13 + addpd %xmm9, %xmm10 + addpd sqrt_coeff+32+__svml_dasin_data_internal(%rip), %xmm13 + mulpd %xmm12, %xmm10 + mulpd %xmm13, %xmm14 + movups poly_coeff+128+__svml_dasin_data_internal(%rip), %xmm11 + mulpd %xmm2, %xmm11 + addpd sqrt_coeff+48+__svml_dasin_data_internal(%rip), %xmm14 + addpd poly_coeff+144+__svml_dasin_data_internal(%rip), %xmm11 + mulpd %xmm14, %xmm0 + addpd %xmm10, %xmm11 + subpd %xmm15, %xmm0 + mulpd %xmm11, %xmm12 + movups poly_coeff+160+__svml_dasin_data_internal(%rip), %xmm13 + movaps %xmm1, %xmm14 + mulpd %xmm2, %xmm13 + addpd poly_coeff+176+__svml_dasin_data_internal(%rip), %xmm13 + addpd %xmm12, %xmm13 + mulpd %xmm13, %xmm2 + andnps %xmm4, %xmm14 + andps %xmm1, %xmm0 + orps %xmm0, %xmm14 + mulpd %xmm14, %xmm2 + addpd %xmm2, %xmm14 + movups Pi2H+__svml_dasin_data_internal(%rip), %xmm0 + andps %xmm1, %xmm0 + addpd %xmm14, %xmm0 + pxor %xmm3, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm5 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm5, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call asin@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2v_asin_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dasin_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 AbsMask[2][2]; + __declspec(align(16)) VUINT32 OneHalf[2][2]; + __declspec(align(16)) VUINT32 SmallNorm[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 Two[2][2]; + __declspec(align(16)) VUINT32 sqrt_coeff[4][2][2]; + __declspec(align(16)) VUINT32 poly_coeff[12][2][2]; + __declspec(align(16)) VUINT32 Pi2H[2][2]; +} __svml_dasin_data_internal; +#endif +__svml_dasin_data_internal: + /*== AbsMask ==*/ + .quad 0x7fffffffffffffff, 0x7fffffffffffffff + /*== OneHalf ==*/ + .align 16 + .quad 0x3fe0000000000000, 0x3fe0000000000000 + /*== SmallNorm ==*/ + .align 16 + .quad 0x3000000000000000, 0x3000000000000000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== Two ==*/ + .align 16 + .quad 0x4000000000000000, 0x4000000000000000 + /*== sqrt_coeff[4] ==*/ + .align 16 + .quad 0xbf918000993B24C3, 0xbf918000993B24C3 /* sqrt_coeff4 */ + .quad 0x3fa400006F70D42D, 0x3fa400006F70D42D /* sqrt_coeff3 */ + .quad 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97 /* sqrt_coeff2 */ + .quad 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D /* sqrt_coeff1 */ + /*== poly_coeff[12] ==*/ + .align 16 + .quad 0x3fa07520C70EB909, 0x3fa07520C70EB909 /* poly_coeff12 */ + .quad 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED /* poly_coeff11 */ + .quad 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE /* poly_coeff10 */ + .quad 0x3f7A583395D45ED5, 0x3f7A583395D45ED5 /* poly_coeff9 */ + .quad 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6 /* poly_coeff8 */ + .quad 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57 /* poly_coeff7 */ + .quad 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E /* poly_coeff6 */ + .quad 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd /* poly_coeff5 */ + .quad 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE /* poly_coeff4 */ + .quad 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8 /* poly_coeff3 */ + .quad 0x3fb333333337E0DE, 0x3fb333333337E0DE /* poly_coeff2 */ + .quad 0x3fc555555555529C, 0x3fc555555555529C /* poly_coeff1 */ + /*== Pi2H ==*/ + .align 16 + .quad 0x3ff921fb54442d18, 0x3ff921fb54442d18 + .align 16 + .type __svml_dasin_data_internal,@object + .size __svml_dasin_data_internal,.-__svml_dasin_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core-sse.S new file mode 100644 index 0000000000..1006fddc59 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized asin, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_asin _ZGVdN4v_asin_sse_wrapper +#include "../svml_d_asin4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core.c new file mode 100644 index 0000000000..b896516f5e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized asin, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_asin +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_asin, __GI__ZGVdN4v_asin, __redirect__ZGVdN4v_asin) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core_avx2.S new file mode 100644 index 0000000000..80467b616f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin4_core_avx2.S @@ -0,0 +1,273 @@ +/* Function asin vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * SelMask = (|x| >= 0.5) ? 1 : 0; + * R = SelMask ? sqrt(0.5 - 0.5*|x|) : |x| + * asin(x) = (SelMask ? (Pi/2 - 2*Poly(R)) : Poly(R))*(-1)^sign(x) + * + */ + +/* Offsets for data table __svml_dasin_data_internal + */ +#define AbsMask 0 +#define OneHalf 32 +#define SmallNorm 64 +#define One 96 +#define Two 128 +#define sqrt_coeff 160 +#define poly_coeff 288 +#define Pi2H 672 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_asin_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + vmovupd __svml_dasin_data_internal(%rip), %ymm6 + vmovupd OneHalf+__svml_dasin_data_internal(%rip), %ymm10 + vmovupd One+__svml_dasin_data_internal(%rip), %ymm8 + vmovapd %ymm0, %ymm5 + +/* x = |arg| */ + vandpd %ymm5, %ymm6, %ymm4 + +/* Y = 0.5 - 0.5*x */ + vmovapd %ymm10, %ymm15 + vfnmadd231pd %ymm4, %ymm10, %ymm15 + +/* x^2 */ + vmulpd %ymm4, %ymm4, %ymm7 + vcmplt_oqpd %ymm4, %ymm8, %ymm9 + +/* S ~ -2*sqrt(Y) */ + vcmplt_oqpd SmallNorm+__svml_dasin_data_internal(%rip), %ymm15, %ymm13 + vminpd %ymm15, %ymm7, %ymm2 + vaddpd %ymm15, %ymm15, %ymm7 + vcmpnlt_uqpd %ymm10, %ymm4, %ymm1 + vcvtpd2ps %ymm15, %xmm11 + vmovupd poly_coeff+64+__svml_dasin_data_internal(%rip), %ymm10 + vmulpd %ymm2, %ymm2, %ymm15 + vrsqrtps %xmm11, %xmm12 + vmovupd poly_coeff+192+__svml_dasin_data_internal(%rip), %ymm11 + vfmadd213pd poly_coeff+96+__svml_dasin_data_internal(%rip), %ymm2, %ymm10 + vcvtps2pd %xmm12, %ymm14 + vmulpd %ymm15, %ymm15, %ymm12 + vfmadd213pd poly_coeff+224+__svml_dasin_data_internal(%rip), %ymm2, %ymm11 + vandnpd %ymm14, %ymm13, %ymm0 + vandnpd %ymm5, %ymm6, %ymm3 + vmulpd %ymm0, %ymm0, %ymm6 + vmovupd poly_coeff+128+__svml_dasin_data_internal(%rip), %ymm13 + vmovupd poly_coeff+256+__svml_dasin_data_internal(%rip), %ymm14 + vfmadd213pd poly_coeff+160+__svml_dasin_data_internal(%rip), %ymm2, %ymm13 + vfmadd213pd poly_coeff+288+__svml_dasin_data_internal(%rip), %ymm2, %ymm14 + vfmadd213pd %ymm11, %ymm15, %ymm13 + vmovmskpd %ymm9, %edx + vmulpd %ymm7, %ymm0, %ymm9 + vfmsub213pd Two+__svml_dasin_data_internal(%rip), %ymm6, %ymm7 + +/* polynomial */ + vmovupd poly_coeff+__svml_dasin_data_internal(%rip), %ymm6 + vmovupd sqrt_coeff+__svml_dasin_data_internal(%rip), %ymm0 + vmulpd %ymm7, %ymm9, %ymm8 + vfmadd213pd poly_coeff+32+__svml_dasin_data_internal(%rip), %ymm2, %ymm6 + vfmadd213pd sqrt_coeff+32+__svml_dasin_data_internal(%rip), %ymm7, %ymm0 + vfmadd213pd %ymm10, %ymm15, %ymm6 + vmovupd poly_coeff+320+__svml_dasin_data_internal(%rip), %ymm10 + vfmadd213pd sqrt_coeff+64+__svml_dasin_data_internal(%rip), %ymm7, %ymm0 + vfmadd213pd %ymm13, %ymm12, %ymm6 + vfmadd213pd poly_coeff+352+__svml_dasin_data_internal(%rip), %ymm2, %ymm10 + vfmadd213pd sqrt_coeff+96+__svml_dasin_data_internal(%rip), %ymm7, %ymm0 + vfmadd213pd %ymm14, %ymm15, %ymm6 + vfmsub213pd %ymm9, %ymm8, %ymm0 + vfmadd213pd %ymm10, %ymm15, %ymm6 + vblendvpd %ymm1, %ymm0, %ymm4, %ymm4 + vmulpd %ymm6, %ymm2, %ymm2 + vfmadd213pd %ymm4, %ymm4, %ymm2 + vandpd Pi2H+__svml_dasin_data_internal(%rip), %ymm1, %ymm1 + vaddpd %ymm2, %ymm1, %ymm0 + vxorpd %ymm3, %ymm0, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm5 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm5, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call asin@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_asin_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dasin_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 AbsMask[4][2]; + __declspec(align(32)) VUINT32 OneHalf[4][2]; + __declspec(align(32)) VUINT32 SmallNorm[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 Two[4][2]; + __declspec(align(32)) VUINT32 sqrt_coeff[4][4][2]; + __declspec(align(32)) VUINT32 poly_coeff[12][4][2]; + __declspec(align(32)) VUINT32 Pi2H[4][2]; +} __svml_dasin_data_internal; +#endif +__svml_dasin_data_internal: + /*== AbsMask ==*/ + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== OneHalf ==*/ + .align 32 + .quad 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000 + /*== SmallNorm ==*/ + .align 32 + .quad 0x3000000000000000, 0x3000000000000000, 0x3000000000000000, 0x3000000000000000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== Two ==*/ + .align 32 + .quad 0x4000000000000000, 0x4000000000000000, 0x4000000000000000, 0x4000000000000000 + /*== sqrt_coeff[4] ==*/ + .align 32 + .quad 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3 /* sqrt_coeff4 */ + .quad 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D /* sqrt_coeff3 */ + .quad 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97 /* sqrt_coeff2 */ + .quad 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D /* sqrt_coeff1 */ + /*== poly_coeff[12] ==*/ + .align 32 + .quad 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909 /* poly_coeff12 */ + .quad 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED /* poly_coeff11 */ + .quad 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE /* poly_coeff10 */ + .quad 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5 /* poly_coeff9 */ + .quad 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6 /* poly_coeff8 */ + .quad 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57 /* poly_coeff7 */ + .quad 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E /* poly_coeff6 */ + .quad 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd /* poly_coeff5 */ + .quad 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE /* poly_coeff4 */ + .quad 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8 /* poly_coeff3 */ + .quad 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE /* poly_coeff2 */ + .quad 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C /* poly_coeff1 */ + /*== Pi2H ==*/ + .align 32 + .quad 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18 + .align 32 + .type __svml_dasin_data_internal,@object + .size __svml_dasin_data_internal,.-__svml_dasin_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core-avx2.S new file mode 100644 index 0000000000..354a55dfaa --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized asin, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_asin _ZGVeN8v_asin_avx2_wrapper +#include "../svml_d_asin8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core.c new file mode 100644 index 0000000000..b03e4a2b9c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized asin, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_asin +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_asin, __GI__ZGVeN8v_asin, __redirect__ZGVeN8v_asin) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core_avx512.S new file mode 100644 index 0000000000..b2fd8edb13 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asin8_core_avx512.S @@ -0,0 +1,295 @@ +/* Function asin vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * SelMask = (|x| >= 0.5) ? 1 : 0; + * R = SelMask ? sqrt(0.5 - 0.5*|x|) : |x| + * asin(x) = (SelMask ? (Pi/2 - 2*Poly(R)) : Poly(R))*(-1)^sign(x) + * + */ + +/* Offsets for data table __svml_dasin_data_internal + */ +#define AbsMask 0 +#define OneHalf 64 +#define SmallNorm 128 +#define One 192 +#define Two 256 +#define sqrt_coeff_1 320 +#define sqrt_coeff_2 384 +#define sqrt_coeff_3 448 +#define sqrt_coeff_4 512 +#define poly_coeff_1 576 +#define poly_coeff_2 640 +#define poly_coeff_3 704 +#define poly_coeff_4 768 +#define poly_coeff_5 832 +#define poly_coeff_6 896 +#define poly_coeff_7 960 +#define poly_coeff_8 1024 +#define poly_coeff_9 1088 +#define poly_coeff_10 1152 +#define poly_coeff_11 1216 +#define poly_coeff_12 1280 +#define Pi2H 1344 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_asin_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups OneHalf+__svml_dasin_data_internal(%rip), %zmm8 + +/* S ~ -2*sqrt(Y) */ + vmovups SmallNorm+__svml_dasin_data_internal(%rip), %zmm10 + vmovups Two+__svml_dasin_data_internal(%rip), %zmm14 + vmovups sqrt_coeff_1+__svml_dasin_data_internal(%rip), %zmm15 + vmovups sqrt_coeff_2+__svml_dasin_data_internal(%rip), %zmm2 + vmovups sqrt_coeff_3+__svml_dasin_data_internal(%rip), %zmm1 + vmovups One+__svml_dasin_data_internal(%rip), %zmm9 + vmovaps %zmm0, %zmm6 + +/* x = |arg| */ + vandpd __svml_dasin_data_internal(%rip), %zmm6, %zmm4 + +/* Y = 0.5 - 0.5*x */ + vmovaps %zmm8, %zmm11 + vfnmadd231pd {rn-sae}, %zmm4, %zmm8, %zmm11 + +/* x^2 */ + vmulpd {rn-sae}, %zmm4, %zmm4, %zmm7 + vrsqrt14pd %zmm11, %zmm12 + vcmppd $17, {sae}, %zmm10, %zmm11, %k1 + vcmppd $21, {sae}, %zmm8, %zmm4, %k2 + vcmppd $17, {sae}, %zmm4, %zmm9, %k0 + vmovups poly_coeff_5+__svml_dasin_data_internal(%rip), %zmm10 + +/* polynomial */ + vmovups poly_coeff_1+__svml_dasin_data_internal(%rip), %zmm8 + vmovups poly_coeff_3+__svml_dasin_data_internal(%rip), %zmm9 + vminpd {sae}, %zmm11, %zmm7, %zmm3 + vxorpd %zmm12, %zmm12, %zmm12{%k1} + vaddpd {rn-sae}, %zmm11, %zmm11, %zmm0 + vxorpd %zmm6, %zmm4, %zmm5 + vmulpd {rn-sae}, %zmm12, %zmm12, %zmm13 + vmulpd {rn-sae}, %zmm12, %zmm0, %zmm7 + vmovups poly_coeff_7+__svml_dasin_data_internal(%rip), %zmm11 + vmovups poly_coeff_4+__svml_dasin_data_internal(%rip), %zmm12 + vfmsub213pd {rn-sae}, %zmm14, %zmm13, %zmm0 + vmovups sqrt_coeff_4+__svml_dasin_data_internal(%rip), %zmm13 + vfmadd231pd {rn-sae}, %zmm3, %zmm9, %zmm12 + vmovups poly_coeff_11+__svml_dasin_data_internal(%rip), %zmm9 + vfmadd231pd {rn-sae}, %zmm0, %zmm15, %zmm2 + vmovups poly_coeff_9+__svml_dasin_data_internal(%rip), %zmm15 + vmulpd {rn-sae}, %zmm0, %zmm7, %zmm14 + vfmadd213pd {rn-sae}, %zmm1, %zmm0, %zmm2 + vmovups poly_coeff_2+__svml_dasin_data_internal(%rip), %zmm1 + kmovw %k0, %edx + vfmadd213pd {rn-sae}, %zmm13, %zmm0, %zmm2 + vfmadd231pd {rn-sae}, %zmm3, %zmm8, %zmm1 + vmovups poly_coeff_10+__svml_dasin_data_internal(%rip), %zmm8 + vmulpd {rn-sae}, %zmm3, %zmm3, %zmm0 + vfmsub213pd {rn-sae}, %zmm7, %zmm14, %zmm2 + vmovups poly_coeff_6+__svml_dasin_data_internal(%rip), %zmm7 + vfmadd231pd {rn-sae}, %zmm3, %zmm15, %zmm8 + vfmadd213pd {rn-sae}, %zmm12, %zmm0, %zmm1 + vblendmpd %zmm2, %zmm4, %zmm2{%k2} + vfmadd231pd {rn-sae}, %zmm3, %zmm10, %zmm7 + vmovups poly_coeff_8+__svml_dasin_data_internal(%rip), %zmm10 + vmovups Pi2H+__svml_dasin_data_internal(%rip), %zmm4 + vfmadd231pd {rn-sae}, %zmm3, %zmm11, %zmm10 + vmovups poly_coeff_12+__svml_dasin_data_internal(%rip), %zmm11 + vfmadd213pd {rn-sae}, %zmm10, %zmm0, %zmm7 + vfmadd231pd {rn-sae}, %zmm3, %zmm9, %zmm11 + vmulpd {rn-sae}, %zmm0, %zmm0, %zmm10 + vfmadd213pd {rn-sae}, %zmm7, %zmm10, %zmm1 + vfmadd213pd {rn-sae}, %zmm8, %zmm0, %zmm1 + vfmadd213pd {rn-sae}, %zmm11, %zmm0, %zmm1 + vmulpd {rn-sae}, %zmm3, %zmm1, %zmm3 + vfmadd213pd {rn-sae}, %zmm2, %zmm2, %zmm3 + vaddpd {rn-sae}, %zmm4, %zmm3, %zmm3{%k2} + vxorpd %zmm5, %zmm3, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm6 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm6, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call asin@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_asin_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dasin_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 AbsMask[8][2]; + __declspec(align(64)) VUINT32 OneHalf[8][2]; + __declspec(align(64)) VUINT32 SmallNorm[8][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 Two[8][2]; + __declspec(align(64)) VUINT32 sqrt_coeff[4][8][2]; + __declspec(align(64)) VUINT32 poly_coeff[12][8][2]; + __declspec(align(64)) VUINT32 Pi2H[8][2]; +} __svml_dasin_data_internal; +#endif +__svml_dasin_data_internal: + /*== AbsMask ==*/ + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== OneHalf ==*/ + .align 64 + .quad 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000 + /*== SmallNorm ==*/ + .align 64 + .quad 0x3000000000000000, 0x3000000000000000, 0x3000000000000000, 0x3000000000000000, 0x3000000000000000, 0x3000000000000000, 0x3000000000000000, 0x3000000000000000 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== Two ==*/ + .align 64 + .quad 0x4000000000000000, 0x4000000000000000, 0x4000000000000000, 0x4000000000000000, 0x4000000000000000, 0x4000000000000000, 0x4000000000000000, 0x4000000000000000 + /*== sqrt_coeff[4] ==*/ + .align 64 + .quad 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3, 0xbf918000993B24C3 /* sqrt_coeff4 */ + .quad 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D, 0x3fa400006F70D42D /* sqrt_coeff3 */ + .quad 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97, 0xbfb7FFFFFFFFFE97 /* sqrt_coeff2 */ + .quad 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D, 0x3fcFFFFFFFFFFF9D /* sqrt_coeff1 */ + /*== poly_coeff[12] ==*/ + .align 64 + .quad 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909, 0x3fa07520C70EB909 /* poly_coeff12 */ + .quad 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED, 0xbf90FB17F7DBB0ED /* poly_coeff11 */ + .quad 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE, 0x3f943F44BFBC3BAE /* poly_coeff10 */ + .quad 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5, 0x3f7A583395D45ED5 /* poly_coeff9 */ + .quad 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6, 0x3f88F8DC2AFCCAD6 /* poly_coeff8 */ + .quad 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57, 0x3f8C6DBBCB88BD57 /* poly_coeff7 */ + .quad 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E, 0x3f91C6DCF538AD2E /* poly_coeff6 */ + .quad 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd, 0x3f96E89CEBDEFadd /* poly_coeff5 */ + .quad 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE, 0x3f9F1C72E13AD8BE /* poly_coeff4 */ + .quad 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8, 0x3fa6DB6DB3B445F8 /* poly_coeff3 */ + .quad 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE, 0x3fb333333337E0DE /* poly_coeff2 */ + .quad 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C, 0x3fc555555555529C /* poly_coeff1 */ + /*== Pi2H ==*/ + .align 64 + .quad 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18, 0x3ff921fb54442d18 + .align 64 + .type __svml_dasin_data_internal,@object + .size __svml_dasin_data_internal,.-__svml_dasin_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core-avx2.S new file mode 100644 index 0000000000..e0582f27d4 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized asinf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_asinf _ZGVeN16v_asinf_avx2_wrapper +#include "../svml_s_asinf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core.c new file mode 100644 index 0000000000..4435055566 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized asinf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_asinf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_asinf, __GI__ZGVeN16v_asinf, + __redirect__ZGVeN16v_asinf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core_avx512.S new file mode 100644 index 0000000000..7afdfd1317 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf16_core_avx512.S @@ -0,0 +1,260 @@ +/* Function asinf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * SelMask = (|x| >= 0.5) ? 1 : 0; + * R = SelMask ? sqrt(0.5 - 0.5*|x|) : |x| + * asin(x) = (SelMask ? (Pi/2 - 2*Poly(R)) : Poly(R))*(-1)^sign(x) + * + * + */ + +/* Offsets for data table __svml_sasin_data_internal + */ +#define AbsMask 0 +#define OneHalf 64 +#define SmallNorm 128 +#define One 192 +#define Two 256 +#define sqrt_coeff_1 320 +#define sqrt_coeff_2 384 +#define poly_coeff_1 448 +#define poly_coeff_2 512 +#define poly_coeff_3 576 +#define poly_coeff_4 640 +#define poly_coeff_5 704 +#define Pi2H 768 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_asinf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups __svml_sasin_data_internal(%rip), %zmm4 + vmovups OneHalf+__svml_sasin_data_internal(%rip), %zmm6 + +/* SQ ~ -2*sqrt(Y) */ + vmovups SmallNorm+__svml_sasin_data_internal(%rip), %zmm8 + vmovups Two+__svml_sasin_data_internal(%rip), %zmm12 + vmovups sqrt_coeff_1+__svml_sasin_data_internal(%rip), %zmm13 + vmovups One+__svml_sasin_data_internal(%rip), %zmm7 + vmovaps %zmm0, %zmm3 + +/* x = |arg| */ + vandps %zmm3, %zmm4, %zmm2 + vandnps %zmm3, %zmm4, %zmm1 + +/* x^2 */ + vmulps {rn-sae}, %zmm2, %zmm2, %zmm5 + vcmpps $17, {sae}, %zmm2, %zmm7, %k0 + vcmpps $21, {sae}, %zmm6, %zmm2, %k2 + vmovups poly_coeff_2+__svml_sasin_data_internal(%rip), %zmm7 + kmovw %k0, %edx + +/* Y = 0.5 - 0.5*x */ + vmovaps %zmm6, %zmm9 + vfnmadd231ps {rn-sae}, %zmm2, %zmm6, %zmm9 + vmovups poly_coeff_5+__svml_sasin_data_internal(%rip), %zmm6 + vrsqrt14ps %zmm9, %zmm10 + vcmpps $17, {sae}, %zmm8, %zmm9, %k1 + vminps {sae}, %zmm9, %zmm5, %zmm0 + vmovups sqrt_coeff_2+__svml_sasin_data_internal(%rip), %zmm8 + vmovups poly_coeff_4+__svml_sasin_data_internal(%rip), %zmm5 + vxorps %zmm10, %zmm10, %zmm10{%k1} + vaddps {rn-sae}, %zmm9, %zmm9, %zmm14 + vmulps {rn-sae}, %zmm10, %zmm10, %zmm11 + vmulps {rn-sae}, %zmm10, %zmm14, %zmm4 + vfmsub213ps {rn-sae}, %zmm12, %zmm11, %zmm14 + vmulps {rn-sae}, %zmm14, %zmm4, %zmm15 + vfmadd231ps {rn-sae}, %zmm14, %zmm13, %zmm8 + vmovups poly_coeff_3+__svml_sasin_data_internal(%rip), %zmm14 + +/* polynomial */ + vmovups poly_coeff_1+__svml_sasin_data_internal(%rip), %zmm13 + vfmsub213ps {rn-sae}, %zmm4, %zmm15, %zmm8 + vfmadd231ps {rn-sae}, %zmm0, %zmm14, %zmm5 + vfmadd231ps {rn-sae}, %zmm0, %zmm13, %zmm7 + vmulps {rn-sae}, %zmm0, %zmm0, %zmm15 + vblendmps %zmm8, %zmm2, %zmm2{%k2} + vfmadd213ps {rn-sae}, %zmm5, %zmm15, %zmm7 + vfmadd213ps {rn-sae}, %zmm6, %zmm0, %zmm7 + vmulps {rn-sae}, %zmm0, %zmm7, %zmm9 + vmovups Pi2H+__svml_sasin_data_internal(%rip), %zmm0 + vfmadd213ps {rn-sae}, %zmm2, %zmm2, %zmm9 + vaddps {rn-sae}, %zmm0, %zmm9, %zmm9{%k2} + vxorps %zmm1, %zmm9, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm3, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call asinf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_asinf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_sasin_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 AbsMask[16][1]; + __declspec(align(64)) VUINT32 OneHalf[16][1]; + __declspec(align(64)) VUINT32 SmallNorm[16][1]; + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 Two[16][1]; + __declspec(align(64)) VUINT32 sqrt_coeff[2][16][1]; + __declspec(align(64)) VUINT32 poly_coeff[5][16][1]; + __declspec(align(64)) VUINT32 Pi2H[16][1]; +} __svml_sasin_data_internal; +#endif +__svml_sasin_data_internal: + /*== AbsMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== OneHalf ==*/ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 + /*== SmallNorm ==*/ + .align 64 + .long 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000 + /*== One ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== Two ==*/ + .align 64 + .long 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000 + /*== sqrt_coeff[2] ==*/ + .align 64 + .long 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004 /* sqrt_coeff2 */ + .long 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001 /* sqrt_coeff1 */ + /*== poly_coeff[5] ==*/ + .align 64 + .long 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07 /* poly_coeff5 */ + .long 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B /* poly_coeff4 */ + .long 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4 /* poly_coeff3 */ + .long 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12 /* poly_coeff2 */ + .long 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF /* poly_coeff1 */ + /*== Pi2H ==*/ + .align 64 + .long 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB + .align 64 + .type __svml_sasin_data_internal,@object + .size __svml_sasin_data_internal,.-__svml_sasin_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core-sse2.S new file mode 100644 index 0000000000..b958db7795 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized asinf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_asinf _ZGVbN4v_asinf_sse2 +#include "../svml_s_asinf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core.c new file mode 100644 index 0000000000..5a7aa94264 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized asinf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_asinf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_asinf, __GI__ZGVbN4v_asinf, + __redirect__ZGVbN4v_asinf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core_sse4.S new file mode 100644 index 0000000000..ddcceeb7b9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf4_core_sse4.S @@ -0,0 +1,252 @@ +/* Function asinf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * SelMask = (|x| >= 0.5) ? 1 : 0; + * R = SelMask ? sqrt(0.5 - 0.5*|x|) : |x| + * asin(x) = (SelMask ? (Pi/2 - 2*Poly(R)) : Poly(R))*(-1)^sign(x) + * + * + */ + +/* Offsets for data table __svml_sasin_data_internal + */ +#define AbsMask 0 +#define OneHalf 16 +#define SmallNorm 32 +#define One 48 +#define Two 64 +#define sqrt_coeff 80 +#define poly_coeff 112 +#define Pi2H 192 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_asinf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm2 + movups __svml_sasin_data_internal(%rip), %xmm1 + movups OneHalf+__svml_sasin_data_internal(%rip), %xmm5 + +/* x = |arg| */ + movaps %xmm1, %xmm0 + andps %xmm2, %xmm0 + +/* Y = 0.5 - 0.5*x */ + movaps %xmm5, %xmm3 + mulps %xmm0, %xmm3 + movaps %xmm5, %xmm8 + +/* x^2 */ + movaps %xmm0, %xmm14 + movaps %xmm0, %xmm15 + mulps %xmm0, %xmm14 + subps %xmm3, %xmm8 + cmpnltps %xmm5, %xmm15 + +/* SQ ~ -2*sqrt(Y) */ + rsqrtps %xmm8, %xmm6 + minps %xmm8, %xmm14 + movaps %xmm8, %xmm9 + movaps %xmm14, %xmm10 + cmpltps SmallNorm+__svml_sasin_data_internal(%rip), %xmm9 + mulps %xmm14, %xmm10 + addps %xmm8, %xmm8 + andnps %xmm6, %xmm9 + movaps %xmm15, %xmm3 + movaps %xmm9, %xmm7 + andnps %xmm0, %xmm3 + mulps %xmm9, %xmm7 + andnps %xmm2, %xmm1 + mulps %xmm8, %xmm9 + mulps %xmm7, %xmm8 + +/* polynomial */ + movups poly_coeff+__svml_sasin_data_internal(%rip), %xmm11 + mulps %xmm14, %xmm11 + subps Two+__svml_sasin_data_internal(%rip), %xmm8 + movups poly_coeff+32+__svml_sasin_data_internal(%rip), %xmm12 + mulps %xmm14, %xmm12 + addps poly_coeff+16+__svml_sasin_data_internal(%rip), %xmm11 + mulps %xmm10, %xmm11 + addps poly_coeff+48+__svml_sasin_data_internal(%rip), %xmm12 + movups sqrt_coeff+__svml_sasin_data_internal(%rip), %xmm13 + addps %xmm11, %xmm12 + mulps %xmm8, %xmm13 + mulps %xmm9, %xmm8 + mulps %xmm14, %xmm12 + addps sqrt_coeff+16+__svml_sasin_data_internal(%rip), %xmm13 + addps poly_coeff+64+__svml_sasin_data_internal(%rip), %xmm12 + mulps %xmm8, %xmm13 + mulps %xmm12, %xmm14 + subps %xmm9, %xmm13 + andps %xmm15, %xmm13 + orps %xmm13, %xmm3 + mulps %xmm3, %xmm14 + movups One+__svml_sasin_data_internal(%rip), %xmm4 + addps %xmm14, %xmm3 + cmpltps %xmm0, %xmm4 + movups Pi2H+__svml_sasin_data_internal(%rip), %xmm0 + andps %xmm15, %xmm0 + movmskps %xmm4, %edx + addps %xmm3, %xmm0 + pxor %xmm1, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm2, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call asinf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_asinf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_sasin_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 AbsMask[4][1]; + __declspec(align(16)) VUINT32 OneHalf[4][1]; + __declspec(align(16)) VUINT32 SmallNorm[4][1]; + __declspec(align(16)) VUINT32 One[4][1]; + __declspec(align(16)) VUINT32 Two[4][1]; + __declspec(align(16)) VUINT32 sqrt_coeff[2][4][1]; + __declspec(align(16)) VUINT32 poly_coeff[5][4][1]; + __declspec(align(16)) VUINT32 Pi2H[4][1]; +} __svml_sasin_data_internal; +#endif +__svml_sasin_data_internal: + /*== AbsMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== OneHalf ==*/ + .align 16 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 + /*== SmallNorm ==*/ + .align 16 + .long 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000 + /*== One ==*/ + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== Two ==*/ + .align 16 + .long 0x40000000, 0x40000000, 0x40000000, 0x40000000 + /*== sqrt_coeff[2] ==*/ + .align 16 + .long 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004 /* sqrt_coeff2 */ + .long 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001 /* sqrt_coeff1 */ + /*== poly_coeff[5] ==*/ + .align 16 + .long 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07 /* poly_coeff5 */ + .long 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B /* poly_coeff4 */ + .long 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4 /* poly_coeff3 */ + .long 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12 /* poly_coeff2 */ + .long 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF /* poly_coeff1 */ + /*== Pi2H ==*/ + .align 16 + .long 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB + .align 16 + .type __svml_sasin_data_internal,@object + .size __svml_sasin_data_internal,.-__svml_sasin_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core-sse.S new file mode 100644 index 0000000000..6273c919d6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized asinf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_asinf _ZGVdN8v_asinf_sse_wrapper +#include "../svml_s_asinf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core.c new file mode 100644 index 0000000000..946b25b43f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized asinf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_asinf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_asinf, __GI__ZGVdN8v_asinf, + __redirect__ZGVdN8v_asinf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core_avx2.S new file mode 100644 index 0000000000..89c156dbbb --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinf8_core_avx2.S @@ -0,0 +1,249 @@ +/* Function asinf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * SelMask = (|x| >= 0.5) ? 1 : 0; + * R = SelMask ? sqrt(0.5 - 0.5*|x|) : |x| + * asin(x) = (SelMask ? (Pi/2 - 2*Poly(R)) : Poly(R))*(-1)^sign(x) + * + * + */ + +/* Offsets for data table __svml_sasin_data_internal + */ +#define AbsMask 0 +#define OneHalf 32 +#define SmallNorm 64 +#define One 96 +#define Two 128 +#define sqrt_coeff 160 +#define poly_coeff 224 +#define Pi2H 384 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_asinf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + vmovups __svml_sasin_data_internal(%rip), %ymm5 + vmovups OneHalf+__svml_sasin_data_internal(%rip), %ymm9 + vmovups One+__svml_sasin_data_internal(%rip), %ymm6 + vmovaps %ymm0, %ymm4 + +/* x = |arg| */ + vandps %ymm4, %ymm5, %ymm3 + +/* Y = 0.5 - 0.5*x */ + vmovaps %ymm9, %ymm12 + vfnmadd231ps %ymm3, %ymm9, %ymm12 + +/* x^2 */ + vmulps %ymm3, %ymm3, %ymm7 + vcmplt_oqps %ymm3, %ymm6, %ymm8 + +/* SQ ~ -2*sqrt(Y) */ + vcmplt_oqps SmallNorm+__svml_sasin_data_internal(%rip), %ymm12, %ymm10 + vminps %ymm12, %ymm7, %ymm1 + vaddps %ymm12, %ymm12, %ymm15 + vcmpnlt_uqps %ymm9, %ymm3, %ymm0 + vrsqrtps %ymm12, %ymm11 + vmovups poly_coeff+64+__svml_sasin_data_internal(%rip), %ymm7 + vmulps %ymm1, %ymm1, %ymm6 + vmovups sqrt_coeff+__svml_sasin_data_internal(%rip), %ymm9 + vfmadd213ps poly_coeff+96+__svml_sasin_data_internal(%rip), %ymm1, %ymm7 + vmovmskps %ymm8, %edx + +/* polynomial */ + vmovups poly_coeff+__svml_sasin_data_internal(%rip), %ymm8 + vandnps %ymm11, %ymm10, %ymm13 + vmulps %ymm13, %ymm13, %ymm14 + vfmadd213ps poly_coeff+32+__svml_sasin_data_internal(%rip), %ymm1, %ymm8 + vandnps %ymm4, %ymm5, %ymm2 + vmulps %ymm15, %ymm13, %ymm5 + vfmsub213ps Two+__svml_sasin_data_internal(%rip), %ymm14, %ymm15 + vfmadd213ps %ymm7, %ymm6, %ymm8 + vfmadd213ps sqrt_coeff+32+__svml_sasin_data_internal(%rip), %ymm15, %ymm9 + vmulps %ymm15, %ymm5, %ymm15 + vfmadd213ps poly_coeff+128+__svml_sasin_data_internal(%rip), %ymm1, %ymm8 + vfmsub213ps %ymm5, %ymm15, %ymm9 + vmulps %ymm8, %ymm1, %ymm1 + vblendvps %ymm0, %ymm9, %ymm3, %ymm3 + vfmadd213ps %ymm3, %ymm3, %ymm1 + vandps Pi2H+__svml_sasin_data_internal(%rip), %ymm0, %ymm0 + vaddps %ymm1, %ymm0, %ymm10 + vxorps %ymm2, %ymm10, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm4 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm4, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call asinf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_asinf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_sasin_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 AbsMask[8][1]; + __declspec(align(32)) VUINT32 OneHalf[8][1]; + __declspec(align(32)) VUINT32 SmallNorm[8][1]; + __declspec(align(32)) VUINT32 One[8][1]; + __declspec(align(32)) VUINT32 Two[8][1]; + __declspec(align(32)) VUINT32 sqrt_coeff[2][8][1]; + __declspec(align(32)) VUINT32 poly_coeff[5][8][1]; + __declspec(align(32)) VUINT32 Pi2H[8][1]; +} __svml_sasin_data_internal; +#endif +__svml_sasin_data_internal: + /*== AbsMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== OneHalf ==*/ + .align 32 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 + /*== SmallNorm ==*/ + .align 32 + .long 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000 + /*== One ==*/ + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== Two ==*/ + .align 32 + .long 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000, 0x40000000 + /*== sqrt_coeff[2] ==*/ + .align 32 + .long 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004, 0xbdC00004 /* sqrt_coeff2 */ + .long 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001, 0x3e800001 /* sqrt_coeff1 */ + /*== poly_coeff[5] ==*/ + .align 32 + .long 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07, 0x3d2EDC07 /* poly_coeff5 */ + .long 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B, 0x3CC32A6B /* poly_coeff4 */ + .long 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4, 0x3d3A9AB4 /* poly_coeff3 */ + .long 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12, 0x3d997C12 /* poly_coeff2 */ + .long 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF, 0x3e2AAAFF /* poly_coeff1 */ + /*== Pi2H ==*/ + .align 32 + .long 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB, 0x3fc90FDB + .align 32 + .type __svml_sasin_data_internal,@object + .size __svml_sasin_data_internal,.-__svml_sasin_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_asin2_core.S b/sysdeps/x86_64/fpu/svml_d_asin2_core.S new file mode 100644 index 0000000000..8ff8bc58df --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asin2_core.S @@ -0,0 +1,29 @@ +/* Function asin vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_asin) +WRAPPER_IMPL_SSE2 asin +END (_ZGVbN2v_asin) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_asin) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_asin4_core.S b/sysdeps/x86_64/fpu/svml_d_asin4_core.S new file mode 100644 index 0000000000..dbe33952bc --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asin4_core.S @@ -0,0 +1,29 @@ +/* Function asin vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_asin) +WRAPPER_IMPL_AVX _ZGVbN2v_asin +END (_ZGVdN4v_asin) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_asin) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_asin4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_asin4_core_avx.S new file mode 100644 index 0000000000..513a31bde5 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asin4_core_avx.S @@ -0,0 +1,25 @@ +/* Function asin vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_asin) +WRAPPER_IMPL_AVX _ZGVbN2v_asin +END (_ZGVcN4v_asin) diff --git a/sysdeps/x86_64/fpu/svml_d_asin8_core.S b/sysdeps/x86_64/fpu/svml_d_asin8_core.S new file mode 100644 index 0000000000..06694298cf --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asin8_core.S @@ -0,0 +1,25 @@ +/* Function asin vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_asin) +WRAPPER_IMPL_AVX512 _ZGVdN4v_asin +END (_ZGVeN8v_asin) diff --git a/sysdeps/x86_64/fpu/svml_s_asinf16_core.S b/sysdeps/x86_64/fpu/svml_s_asinf16_core.S new file mode 100644 index 0000000000..015d583e3f --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinf16_core.S @@ -0,0 +1,25 @@ +/* Function asinf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_asinf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_asinf +END (_ZGVeN16v_asinf) diff --git a/sysdeps/x86_64/fpu/svml_s_asinf4_core.S b/sysdeps/x86_64/fpu/svml_s_asinf4_core.S new file mode 100644 index 0000000000..d80f06c16d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinf4_core.S @@ -0,0 +1,29 @@ +/* Function asinf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_asinf) +WRAPPER_IMPL_SSE2 asinf +END (_ZGVbN4v_asinf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_asinf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_asinf8_core.S b/sysdeps/x86_64/fpu/svml_s_asinf8_core.S new file mode 100644 index 0000000000..304ad0a7f5 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinf8_core.S @@ -0,0 +1,29 @@ +/* Function asinf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_asinf) +WRAPPER_IMPL_AVX _ZGVbN4v_asinf +END (_ZGVdN8v_asinf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_asinf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_asinf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_asinf8_core_avx.S new file mode 100644 index 0000000000..a2f7dc112e --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function asinf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_asinf) +WRAPPER_IMPL_AVX _ZGVbN4v_asinf +END (_ZGVcN8v_asinf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx.c new file mode 100644 index 0000000000..e37cfdce58 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-asin.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx2.c new file mode 100644 index 0000000000..e37cfdce58 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-asin.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx512f.c new file mode 100644 index 0000000000..e37cfdce58 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asin-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-asin.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asin.c b/sysdeps/x86_64/fpu/test-double-libmvec-asin.c new file mode 100644 index 0000000000..d2e16e67f4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asin.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC asin +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 467c913990..5746bb5be3 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -29,6 +29,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVbN2v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVbN2vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVbN2v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVbN2v_atan) +VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVbN2v_asin) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index b72a7de84e..8d3d5493ed 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVdN4v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVdN4vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVdN4v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVdN4v_atan) +VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVdN4v_asin) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index d2434df21e..f43328f2ff 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -29,6 +29,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVcN4v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVcN4vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVcN4v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVcN4v_atan) +VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVcN4v_asin) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index f7aaf8159e..8b566c199a 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -29,6 +29,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp), _ZGVeN8v_exp) VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVeN8vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVeN8v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVeN8v_atan) +VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVeN8v_asin) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx.c new file mode 100644 index 0000000000..6aa8f5f370 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-asinf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx2.c new file mode 100644 index 0000000000..6aa8f5f370 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-asinf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx512f.c new file mode 100644 index 0000000000..6aa8f5f370 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-asinf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinf.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinf.c new file mode 100644 index 0000000000..2bbe2395a0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC asinf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index af769c56fa..3d3218a310 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -29,6 +29,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVeN16v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVeN16vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVeN16v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVeN16v_atanf) +VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVeN16v_asinf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 76e61d2f1e..7d75b9f60f 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -29,6 +29,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVbN4v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVbN4vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVbN4v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVbN4v_atanf) +VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVbN4v_asinf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 5e27eaaf29..405dde49bc 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVdN8v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVdN8vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVdN8v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVdN8v_atanf) +VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVdN8v_asinf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 28daf79aa9..7558443f2e 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -29,6 +29,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expf), _ZGVcN8v_expf) VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVcN8vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVcN8v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVcN8v_atanf) +VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVcN8v_asinf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573823 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=O1EfbunC; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmGT1Gvqz9sVq for ; Wed, 29 Dec 2021 07:24:17 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 091433858439 for ; Tue, 28 Dec 2021 20:24:15 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 091433858439 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640723055; bh=ZP9aG4ZdkM5c3t1VxtQgcDzAKJcCzCkkCqwbo4C7GIY=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=O1EfbunCqCkOuW+OaDX/hI2Hr/tfcp/AvEM3de2Yj4qatYPHKYQQ7L84/EZ3/UCuh 7gNEPbmw5806X+fe/NdIK1+FgYjl4uRSpEOhy78yaQaZ27KV9DErNmR19iNU1Wb39A xIwCnPE2w9wVE3+3oBH9TRIEeQPighWSW6K9zYJI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by sourceware.org (Postfix) with ESMTPS id 974703858004 for ; Tue, 28 Dec 2021 20:11:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 974703858004 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="302169140" X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="302169140" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="589098775" Received: from scymds01.sc.intel.com ([10.148.94.138]) by fmsmga004.fm.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsX016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 03/18] x86-64: Add vector hypot/hypotf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:15 -0800 Message-Id: <20211228201130.737370-4-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized hypot/hypotf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector hypot/hypotf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 ++ .../fpu/multiarch/svml_d_hypot2_core-sse2.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_hypot2_core.c | 28 ++ .../fpu/multiarch/svml_d_hypot2_core_sse4.S | 279 +++++++++++++++++ .../fpu/multiarch/svml_d_hypot4_core-sse.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_hypot4_core.c | 28 ++ .../fpu/multiarch/svml_d_hypot4_core_avx2.S | 289 ++++++++++++++++++ .../fpu/multiarch/svml_d_hypot8_core-avx2.S | 20 ++ .../x86_64/fpu/multiarch/svml_d_hypot8_core.c | 28 ++ .../fpu/multiarch/svml_d_hypot8_core_avx512.S | 235 ++++++++++++++ .../fpu/multiarch/svml_s_hypotf16_core-avx2.S | 20 ++ .../fpu/multiarch/svml_s_hypotf16_core.c | 28 ++ .../multiarch/svml_s_hypotf16_core_avx512.S | 239 +++++++++++++++ .../fpu/multiarch/svml_s_hypotf4_core-sse2.S | 20 ++ .../fpu/multiarch/svml_s_hypotf4_core.c | 28 ++ .../fpu/multiarch/svml_s_hypotf4_core_sse4.S | 265 ++++++++++++++++ .../fpu/multiarch/svml_s_hypotf8_core-sse.S | 20 ++ .../fpu/multiarch/svml_s_hypotf8_core.c | 28 ++ .../fpu/multiarch/svml_s_hypotf8_core_avx2.S | 269 ++++++++++++++++ sysdeps/x86_64/fpu/svml_d_hypot2_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_hypot4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_hypot4_core_avx.S | 25 ++ sysdeps/x86_64/fpu/svml_d_hypot8_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_hypotf16_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_hypotf4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_hypotf8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_hypotf8_core_avx.S | 25 ++ .../fpu/test-double-libmvec-hypot-avx.c | 1 + .../fpu/test-double-libmvec-hypot-avx2.c | 1 + .../fpu/test-double-libmvec-hypot-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-hypot.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-hypotf-avx.c | 1 + .../fpu/test-float-libmvec-hypotf-avx2.c | 1 + .../fpu/test-float-libmvec-hypotf-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-hypotf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 2151 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_hypot2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_hypot4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_hypot4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_hypot8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_hypotf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_hypotf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_hypotf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_hypotf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-hypot.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-hypotf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index ae8ee882d0..adf65f6bc2 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -131,4 +131,15 @@ #define __DECL_SIMD_asinf32x #define __DECL_SIMD_asinf64x #define __DECL_SIMD_asinf128x + +#define __DECL_SIMD_hypot +#define __DECL_SIMD_hypotf +#define __DECL_SIMD_hypotl +#define __DECL_SIMD_hypotf16 +#define __DECL_SIMD_hypotf32 +#define __DECL_SIMD_hypotf64 +#define __DECL_SIMD_hypotf128 +#define __DECL_SIMD_hypotf32x +#define __DECL_SIMD_hypotf64x +#define __DECL_SIMD_hypotf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index bb53b7021e..2ed820a0dc 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -144,7 +144,7 @@ __MATHCALL (sqrt,, (_Mdouble_ __x)); #if defined __USE_XOPEN || defined __USE_ISOC99 /* Return `sqrt(X*X + Y*Y)'. */ -__MATHCALL (hypot,, (_Mdouble_ __x, _Mdouble_ __y)); +__MATHCALL_VEC (hypot,, (_Mdouble_ __x, _Mdouble_ __y)); #endif #if defined __USE_XOPEN_EXTENDED || defined __USE_ISOC99 diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index ab03a07f92..12bb03245b 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -49,24 +49,32 @@ GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F +GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F +GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F +GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F +GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F +GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F +GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F +GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F +GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 73cb8849ff..437977c5fd 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -70,6 +70,10 @@ # define __DECL_SIMD_asin __DECL_SIMD_x86_64 # undef __DECL_SIMD_asinf # define __DECL_SIMD_asinf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_hypot +# define __DECL_SIMD_hypot __DECL_SIMD_x86_64 +# undef __DECL_SIMD_hypotf +# define __DECL_SIMD_hypotf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 4552c2bdfa..cda31479a6 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -34,6 +34,8 @@ !GCC$ builtin (atanf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (asin) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (asinf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (hypot) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (hypotf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -53,3 +55,5 @@ !GCC$ builtin (atanf) attributes simd (notinbranch) if('x32') !GCC$ builtin (asin) attributes simd (notinbranch) if('x32') !GCC$ builtin (asinf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (hypot) attributes simd (notinbranch) if('x32') +!GCC$ builtin (hypotf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index e0eae0b196..7769a02731 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -27,6 +27,7 @@ libmvec-funcs = \ atan \ cos \ exp \ + hypot \ log \ pow \ sin \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 10baf869a5..e359e5dc2c 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -17,8 +17,10 @@ libmvec { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; + _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; + _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; } } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index ea0f833381..a7513ec94e 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1375,6 +1375,26 @@ double: 1 float128: 1 ldouble: 1 +Function: "hypot_vlen16": +float: 1 + +Function: "hypot_vlen2": +double: 1 + +Function: "hypot_vlen4": +double: 1 +float: 1 + +Function: "hypot_vlen4_avx2": +double: 1 + +Function: "hypot_vlen8": +double: 1 +float: 1 + +Function: "hypot_vlen8_avx2": +float: 1 + Function: "j0": double: 3 float: 9 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core-sse2.S new file mode 100644 index 0000000000..237e38459e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized hypot. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2vv_hypot _ZGVbN2vv_hypot_sse2 +#include "../svml_d_hypot2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core.c new file mode 100644 index 0000000000..3f0865f05d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized hypot, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2vv_hypot +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2vv_hypot, __GI__ZGVbN2vv_hypot, + __redirect__ZGVbN2vv_hypot) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core_sse4.S new file mode 100644 index 0000000000..931f34e5f2 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot2_core_sse4.S @@ -0,0 +1,279 @@ +/* Function hypot vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * HIGH LEVEL OVERVIEW + * + * Calculate z = (x*x+y*y) + * Calculate reciplicle sqrt (z) + * Calculate error = z*(rsqrt(z)*rsqrt(z)) - 1 + * Calculate fixing part p with polynom + * Fix answer with sqrt(z) = z * rsqrt(z) + error * p * z + * + * ALGORITHM DETAILS + * + * Multiprecision branch for _HA_ only + * Remove sigm from both arguments + * Find maximum (_x) and minimum (_y) (by abs value) between arguments + * Split _x int _a and _b for multiprecision + * If _x >> _y we will we will not split _y for multiprecision + * all _y will be put into lower part (_d) and higher part (_c = 0) + * Fixing _hilo_mask for the case _x >> _y + * Split _y into _c and _d for multiprecision with fixed mask + * + * compute Hi and Lo parts of _z = _x*_x + _y*_y + * + * _zHi = _a*_a + _c*_c + * _zLo = (_x + _a)*_b + _d*_y + _d*_c + * _z = _zHi + _zLo + * + * No multiprecision branch for _LA_ and _EP_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + * + * Check _z exponent to be withing borders [3BC ; 441] else goto Callout + * + * _s ~ 1.0/sqrt(_z) + * _s2 ~ 1.0/(sqrt(_z)*sqrt(_z)) ~ 1.0/_z = (1.0/_z + O) + * _e[rror] = (1.0/_z + O) * _z - 1.0 + * calculate fixing part _p + * _p = (((_POLY_C5*_e + _POLY_C4)*_e +_POLY_C3)*_e +_POLY_C2)*_e + _POLY_C1 + * some parts of polynom are skipped for lower flav + * + * result = _z * (1.0/sqrt(_z) + O) + _p * _e[rror] * _z + * + * + */ + +/* Offsets for data table __svml_dhypot_data_internal + */ +#define _dHiLoMask 0 +#define _dAbsMask 16 +#define _dOne 32 +#define _POLY_C5 48 +#define _POLY_C4 64 +#define _POLY_C3 80 +#define _POLY_C2 96 +#define _POLY_C1 112 +#define _LowBoundary 128 +#define _HighBoundary 144 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2vv_hypot_sse4) + subq $88, %rsp + cfi_def_cfa_offset(96) + +/* + * Defines + * Implementation + * Multiprecision branch for _HA_ only + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + */ + movaps %xmm0, %xmm10 + movaps %xmm1, %xmm2 + mulpd %xmm0, %xmm10 + mulpd %xmm1, %xmm2 + addpd %xmm2, %xmm10 + +/* + * _s ~ 1.0/sqrt(_z) + * _s2 ~ 1.0/(sqrt(_z)*sqrt(_z)) ~ 1.0/_z + */ + cvtpd2ps %xmm10, %xmm7 + movlhps %xmm7, %xmm7 + rsqrtps %xmm7, %xmm8 + cvtps2pd %xmm8, %xmm11 + movaps %xmm11, %xmm2 + mulpd %xmm11, %xmm2 + +/* _e[rror] ~ (1.0/_z + O) * _z - 1.0 */ + mulpd %xmm10, %xmm2 + subpd _dOne+__svml_dhypot_data_internal(%rip), %xmm2 + +/* + * calculate fixing part _p + * _p = (((_POLY_C5*_e + _POLY_C4)*_e +_POLY_C3)*_e +_POLY_C2)*_e + _POLY_C1 + * some parts of polynom are skipped for lower flav + */ + movups _POLY_C4+__svml_dhypot_data_internal(%rip), %xmm9 + mulpd %xmm2, %xmm9 + addpd _POLY_C3+__svml_dhypot_data_internal(%rip), %xmm9 + mulpd %xmm2, %xmm9 + addpd _POLY_C2+__svml_dhypot_data_internal(%rip), %xmm9 + mulpd %xmm2, %xmm9 + addpd _POLY_C1+__svml_dhypot_data_internal(%rip), %xmm9 + +/* result = _z * (1.0/sqrt(_z) + O) + _p * _e[rror] * _z */ + mulpd %xmm9, %xmm2 + mulpd %xmm11, %xmm2 + mulpd %xmm10, %xmm11 + mulpd %xmm10, %xmm2 + +/* Check _z exponent to be withing borders [3BC ; 441] else goto Callout */ + movq _LowBoundary+__svml_dhypot_data_internal(%rip), %xmm5 + movq _HighBoundary+__svml_dhypot_data_internal(%rip), %xmm3 + pshufd $221, %xmm10, %xmm4 + pcmpgtd %xmm4, %xmm5 + pcmpgtd %xmm3, %xmm4 + por %xmm4, %xmm5 + pshufd $80, %xmm5, %xmm6 + movmskpd %xmm6, %edx + addpd %xmm11, %xmm2 + +/* The end of implementation */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 xmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm2, %xmm0 + addq $88, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(96) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + movups %xmm2, 64(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -80) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -88) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -96) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 64(%rsp), %xmm2 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -80) + cfi_offset(13, -88) + cfi_offset(14, -96) + # LOE rbx rbp r12 r13 r14 r15 xmm2 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + movsd 48(%rsp,%r14,8), %xmm1 + call hypot@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2vv_hypot_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dhypot_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dHiLoMask[2][2]; + __declspec(align(16)) VUINT32 _dAbsMask[2][2]; + __declspec(align(16)) VUINT32 _dOne[2][2]; + __declspec(align(16)) VUINT32 _POLY_C5[2][2]; + __declspec(align(16)) VUINT32 _POLY_C4[2][2]; + __declspec(align(16)) VUINT32 _POLY_C3[2][2]; + __declspec(align(16)) VUINT32 _POLY_C2[2][2]; + __declspec(align(16)) VUINT32 _POLY_C1[2][2]; + __declspec(align(16)) VUINT32 _LowBoundary[4][1]; + __declspec(align(16)) VUINT32 _HighBoundary[4][1]; +} __svml_dhypot_data_internal; +#endif +__svml_dhypot_data_internal: + /* legacy algorithm */ + .quad 0xffffc00000000000, 0xffffc00000000000 /* _dHiLoMask */ + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff /* _dAbsMask */ + .align 16 + .quad 0x3FF0000000000000, 0x3FF0000000000000 /* _dOne */ + .align 16 + .quad 0xBFCF800000000000, 0xBFCF800000000000 /* _POLY_C5 */ + .align 16 + .quad 0x3FD1800000000000, 0x3FD1800000000000 /* _POLY_C4 */ + .align 16 + .quad 0xBFD4000000000000, 0xBFD4000000000000 /* _POLY_C3 */ + .align 16 + .quad 0x3FD8000000000000, 0x3FD8000000000000 /* _POLY_C2 */ + .align 16 + .quad 0xBFE0000000000000, 0xBFE0000000000000 /* _POLY_C1 */ + .align 16 + .long 0x3BC00000, 0x3BC00000, 0x3BC00000, 0x3BC00000 /* _LowBoundary */ + .align 16 + .long 0x44100000, 0x44100000, 0x44100000, 0x44100000 /* _HighBoundary */ + .align 16 + .type __svml_dhypot_data_internal,@object + .size __svml_dhypot_data_internal,.-__svml_dhypot_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core-sse.S new file mode 100644 index 0000000000..5e7c75c44c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized hypot. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4vv_hypot _ZGVdN4vv_hypot_sse_wrapper +#include "../svml_d_hypot4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core.c new file mode 100644 index 0000000000..06f34d35e1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized hypot, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4vv_hypot +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4vv_hypot, __GI__ZGVdN4vv_hypot, + __redirect__ZGVdN4vv_hypot) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core_avx2.S new file mode 100644 index 0000000000..45028ab7e9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot4_core_avx2.S @@ -0,0 +1,289 @@ +/* Function hypot vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * HIGH LEVEL OVERVIEW + * + * Calculate z = (x*x+y*y) + * Calculate reciplicle sqrt (z) + * Calculate error = z*(rsqrt(z)*rsqrt(z)) - 1 + * Calculate fixing part p with polynom + * Fix answer with sqrt(z) = z * rsqrt(z) + error * p * z + * + * ALGORITHM DETAILS + * + * Multiprecision branch for _HA_ only + * Remove sigm from both arguments + * Find maximum (_x) and minimum (_y) (by abs value) between arguments + * Split _x int _a and _b for multiprecision + * If _x >> _y we will we will not split _y for multiprecision + * all _y will be put into lower part (_d) and higher part (_c = 0) + * Fixing _hilo_mask for the case _x >> _y + * Split _y into _c and _d for multiprecision with fixed mask + * + * compute Hi and Lo parts of _z = _x*_x + _y*_y + * + * _zHi = _a*_a + _c*_c + * _zLo = (_x + _a)*_b + _d*_y + _d*_c + * _z = _zHi + _zLo + * + * No multiprecision branch for _LA_ and _EP_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + * + * Check _z exponent to be withing borders [3BC ; 441] else goto Callout + * + * _s ~ 1.0/sqrt(_z) + * _s2 ~ 1.0/(sqrt(_z)*sqrt(_z)) ~ 1.0/_z = (1.0/_z + O) + * _e[rror] = (1.0/_z + O) * _z - 1.0 + * calculate fixing part _p + * _p = (((_POLY_C5*_e + _POLY_C4)*_e +_POLY_C3)*_e +_POLY_C2)*_e + _POLY_C1 + * some parts of polynom are skipped for lower flav + * + * result = _z * (1.0/sqrt(_z) + O) + _p * _e[rror] * _z + * + * + */ + +/* Offsets for data table __svml_dhypot_data_internal + */ +#define _dHiLoMask 0 +#define _dAbsMask 32 +#define _dOne 64 +#define _POLY_C5 96 +#define _POLY_C4 128 +#define _POLY_C3 160 +#define _POLY_C2 192 +#define _POLY_C1 224 +#define _LowBoundary 256 +#define _HighBoundary 288 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4vv_hypot_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $128, %rsp + vmovapd %ymm1, %ymm2 + vmovapd %ymm0, %ymm1 + +/* + * Defines + * Implementation + * Multiprecision branch for _HA_ only + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + */ + vmulpd %ymm1, %ymm1, %ymm0 + +/* + * calculate fixing part _p + * _p = (((_POLY_C5*_e + _POLY_C4)*_e +_POLY_C3)*_e +_POLY_C2)*_e + _POLY_C1 + * some parts of polynom are skipped for lower flav + */ + vmovupd _POLY_C4+__svml_dhypot_data_internal(%rip), %ymm15 + vmovups _LowBoundary+__svml_dhypot_data_internal(%rip), %xmm4 + vfmadd231pd %ymm2, %ymm2, %ymm0 + +/* + * _s ~ 1.0/sqrt(_z) + * _s2 ~ 1.0/(sqrt(_z)*sqrt(_z)) ~ 1.0/_z + */ + vcvtpd2ps %ymm0, %xmm12 + +/* Check _z exponent to be withing borders [3BC ; 441] else goto Callout */ + vextractf128 $1, %ymm0, %xmm3 + vrsqrtps %xmm12, %xmm13 + vshufps $221, %xmm3, %xmm0, %xmm5 + vcvtps2pd %xmm13, %ymm3 + vpcmpgtd %xmm5, %xmm4, %xmm6 + vpcmpgtd _HighBoundary+__svml_dhypot_data_internal(%rip), %xmm5, %xmm7 + vpor %xmm7, %xmm6, %xmm9 + vpshufd $80, %xmm9, %xmm8 + vmulpd %ymm3, %ymm3, %ymm14 + vpshufd $250, %xmm9, %xmm10 + +/* _e[rror] ~ (1.0/_z + O) * _z - 1.0 */ + vfmsub213pd _dOne+__svml_dhypot_data_internal(%rip), %ymm0, %ymm14 + vfmadd213pd _POLY_C3+__svml_dhypot_data_internal(%rip), %ymm14, %ymm15 + vfmadd213pd _POLY_C2+__svml_dhypot_data_internal(%rip), %ymm14, %ymm15 + vfmadd213pd _POLY_C1+__svml_dhypot_data_internal(%rip), %ymm14, %ymm15 + +/* result = _z * (1.0/sqrt(_z) + O) + _p * _e[rror] * _z */ + vmulpd %ymm15, %ymm14, %ymm14 + vmulpd %ymm14, %ymm3, %ymm15 + vmulpd %ymm15, %ymm0, %ymm4 + vfmadd213pd %ymm4, %ymm3, %ymm0 + vinsertf128 $1, %xmm10, %ymm8, %ymm11 + vmovmskpd %ymm11, %edx + +/* The end of implementation */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm1 ymm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm1, 32(%rsp) + vmovupd %ymm2, 64(%rsp) + vmovupd %ymm0, 96(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 96(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + movsd 64(%rsp,%r14,8), %xmm1 + call hypot@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 96(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4vv_hypot_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dhypot_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dHiLoMask[4][2]; + __declspec(align(32)) VUINT32 _dAbsMask[4][2]; + __declspec(align(32)) VUINT32 _dOne[4][2]; + __declspec(align(32)) VUINT32 _POLY_C5[4][2]; + __declspec(align(32)) VUINT32 _POLY_C4[4][2]; + __declspec(align(32)) VUINT32 _POLY_C3[4][2]; + __declspec(align(32)) VUINT32 _POLY_C2[4][2]; + __declspec(align(32)) VUINT32 _POLY_C1[4][2]; + __declspec(align(32)) VUINT32 _LowBoundary[8][1]; + __declspec(align(32)) VUINT32 _HighBoundary[8][1]; +} __svml_dhypot_data_internal; +#endif +__svml_dhypot_data_internal: + /* legacy algorithm */ + .quad 0xffffc00000000000, 0xffffc00000000000, 0xffffc00000000000, 0xffffc00000000000 /* _dHiLoMask */ + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff /* _dAbsMask */ + .align 32 + .quad 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000 /* _dOne */ + .align 32 + .quad 0xBFCF800000000000, 0xBFCF800000000000, 0xBFCF800000000000, 0xBFCF800000000000 /* _POLY_C5 */ + .align 32 + .quad 0x3FD1800000000000, 0x3FD1800000000000, 0x3FD1800000000000, 0x3FD1800000000000 /* _POLY_C4 */ + .align 32 + .quad 0xBFD4000000000000, 0xBFD4000000000000, 0xBFD4000000000000, 0xBFD4000000000000 /* _POLY_C3 */ + .align 32 + .quad 0x3FD8000000000000, 0x3FD8000000000000, 0x3FD8000000000000, 0x3FD8000000000000 /* _POLY_C2 */ + .align 32 + .quad 0xBFE0000000000000, 0xBFE0000000000000, 0xBFE0000000000000, 0xBFE0000000000000 /* _POLY_C1 */ + .align 32 + .long 0x3BC00000, 0x3BC00000, 0x3BC00000, 0x3BC00000, 0x3BC00000, 0x3BC00000, 0x3BC00000, 0x3BC00000 /* _LowBoundary */ + .align 32 + .long 0x44100000, 0x44100000, 0x44100000, 0x44100000, 0x44100000, 0x44100000, 0x44100000, 0x44100000 /* _HighBoundary */ + .align 32 + .type __svml_dhypot_data_internal,@object + .size __svml_dhypot_data_internal,.-__svml_dhypot_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core-avx2.S new file mode 100644 index 0000000000..a53e82cf9a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized hypot. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8vv_hypot _ZGVeN8vv_hypot_avx2_wrapper +#include "../svml_d_hypot8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core.c new file mode 100644 index 0000000000..6052c752c9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized hypot, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8vv_hypot +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8vv_hypot, __GI__ZGVeN8vv_hypot, + __redirect__ZGVeN8vv_hypot) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core_avx512.S new file mode 100644 index 0000000000..1e5e716a8d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_hypot8_core_avx512.S @@ -0,0 +1,235 @@ +/* Function hypot vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * HIGH LEVEL OVERVIEW + * + * Calculate z = (x*x+y*y) + * Calculate reciplicle sqrt (z) + * Calculate error = z*(rsqrt(z)*rsqrt(z)) - 1 + * Calculate fixing part p with polynom + * Fix answer with sqrt(z) = z * rsqrt(z) + error * p * z + * + * ALGORITHM DETAILS + * + * Multiprecision branch for _HA_ only + * Remove sigm from both arguments + * Find maximum (_x) and minimum (_y) (by abs value) between arguments + * Split _x int _a and _b for multiprecision + * If _x >> _y we will we will not split _y for multiprecision + * all _y will be put into lower part (_d) and higher part (_c = 0) + * Fixing _hilo_mask for the case _x >> _y + * Split _y into _c and _d for multiprecision with fixed mask + * + * compute Hi and Lo parts of _z = _x*_x + _y*_y + * + * _zHi = _a*_a + _c*_c + * _zLo = (_x + _a)*_b + _d*_y + _d*_c + * _z = _zHi + _zLo + * + * No multiprecision branch for _LA_ and _EP_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + * + * Check _z exponent to be withing borders [3BC ; 441] else goto Callout + * + * _s ~ 1.0/sqrt(_z) + * _s2 ~ 1.0/(sqrt(_z)*sqrt(_z)) ~ 1.0/_z = (1.0/_z + O) + * _e[rror] = (1.0/_z + O) * _z - 1.0 + * calculate fixing part _p + * _p = (((_POLY_C5*_e + _POLY_C4)*_e +_POLY_C3)*_e +_POLY_C2)*_e + _POLY_C1 + * some parts of polynom are skipped for lower flav + * + * result = _z * (1.0/sqrt(_z) + O) + _p * _e[rror] * _z + * + * + */ + +/* Offsets for data table __svml_dhypot_data_internal + */ +#define _dAbsMask 0 +#define _lExpBound_uisa 64 +#define _lExpBound 128 +#define _dHalf 192 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8vv_hypot_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $256, %rsp + vgetexppd {sae}, %zmm0, %zmm2 + vgetexppd {sae}, %zmm1, %zmm3 + vmovups _dHalf+__svml_dhypot_data_internal(%rip), %zmm9 + vmaxpd {sae}, %zmm3, %zmm2, %zmm4 + vmulpd {rn-sae}, %zmm0, %zmm0, %zmm2 + vandpd _dAbsMask+__svml_dhypot_data_internal(%rip), %zmm4, %zmm5 + vfmadd231pd {rn-sae}, %zmm1, %zmm1, %zmm2 + +/* Select exponent bound so that no scaling is needed */ + vpcmpq $5, _lExpBound_uisa+__svml_dhypot_data_internal(%rip), %zmm5, %k0 + vrsqrt14pd %zmm2, %zmm6 + kmovw %k0, %edx + vmulpd {rn-sae}, %zmm6, %zmm2, %zmm7 + vmulpd {rn-sae}, %zmm6, %zmm9, %zmm8 + vfnmadd231pd {rn-sae}, %zmm7, %zmm8, %zmm9 + vfmadd231pd {rn-sae}, %zmm9, %zmm8, %zmm8 + vfmadd213pd {rn-sae}, %zmm7, %zmm7, %zmm9 + vfnmadd231pd {rn-sae}, %zmm9, %zmm9, %zmm2 + vfmadd213pd {rn-sae}, %zmm9, %zmm8, %zmm2 + +/* The end of implementation */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 zmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm2, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm1, 128(%rsp) + vmovups %zmm2, 192(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm2 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 192(%rsp), %zmm2 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm2 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + movsd 128(%rsp,%r14,8), %xmm1 + call hypot@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 192(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8vv_hypot_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dhypot_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _dAbsMask[8][2]; + __declspec(align(64)) VUINT32 _lExpBound_uisa[8][2]; + __declspec(align(64)) VUINT32 _lExpBound[8][2]; + __declspec(align(64)) VUINT32 _dHalf[8][2]; +} __svml_dhypot_data_internal; +#endif +__svml_dhypot_data_internal: + /* legacy algorithm */ + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff /* _dAbsMask */ + /* fma based algorithm*/ + .align 64 + .quad 0x407ff00000000000, 0x407ff00000000000, 0x407ff00000000000, 0x407ff00000000000, 0x407ff00000000000, 0x407ff00000000000, 0x407ff00000000000, 0x407ff00000000000 /* _lExpBound_uisa */ + .align 64 + .quad 0x404f800000000000, 0x404f800000000000, 0x404f800000000000, 0x404f800000000000, 0x404f800000000000, 0x404f800000000000, 0x404f800000000000, 0x404f800000000000 /* _lExpBound */ + .align 64 + .quad 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000 /* _dHalf */ + .align 64 + .type __svml_dhypot_data_internal,@object + .size __svml_dhypot_data_internal,.-__svml_dhypot_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core-avx2.S new file mode 100644 index 0000000000..a6ba40df4d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized hypotf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16vv_hypotf _ZGVeN16vv_hypotf_avx2_wrapper +#include "../svml_s_hypotf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core.c new file mode 100644 index 0000000000..0c9eb6a364 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized hypotf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16vv_hypotf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16vv_hypotf, __GI__ZGVeN16vv_hypotf, + __redirect__ZGVeN16vv_hypotf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core_avx512.S new file mode 100644 index 0000000000..46a156d136 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf16_core_avx512.S @@ -0,0 +1,239 @@ +/* Function hypotf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * HIGH LEVEL OVERVIEW + * + * Calculate z = (x*x+y*y) + * Calculate reciplicle sqrt (z) + * Calculate make two NR iterations + * + * ALGORITHM DETAILS + * + * Multiprecision branch for _HA_ only + * Remove sigm from both arguments + * Find maximum (_x) and minimum (_y) (by abs value) between arguments + * Split _x int _a and _b for multiprecision + * If _x >> _y we will we will not split _y for multiprecision + * all _y will be put into lower part (_d) and higher part (_c = 0) + * Fixing _hilo_mask for the case _x >> _y + * Split _y into _c and _d for multiprecision with fixed mask + * + * compute Hi and Lo parts of _z = _x*_x + _y*_y + * + * _zHi = _a*_a + _c*_c + * _zLo = (_x + _a)*_b + _d*_y + _d*_c + * _z = _zHi + _zLo + * + * No multiprecision branch for _LA_ and _EP_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + * + * Check _z exponent to be withing borders [1E3 ; 60A] else goto Callout + * + * Compute resciplicle sqrt s0 ~ 1.0/sqrt(_z), + * that multiplied by _z, is final result for _EP_ version. + * + * First iteration (or zero iteration): + * s = z * s0 + * h = .5 * s0 + * d = s * h - .5 + * + * Second iteration: + * h = d * h + h + * s = s * d + s + * d = s * s - z (in multiprecision for _HA_) + * + * result = s - h * d + * + * EP version of the function can be implemented as y[i]=sqrt(a[i]^2+b[i]^2) + * with all intermediate operations done in target precision for i=1,..,n. + * It can return result y[i]=0 in case a[i]^2 and b[i]^2 underflow in target + * precision (for some i). It can return result y[i]=NAN in case + * a[i]^2+b[i]^2 overflow in target precision, for some i. It can return + * result y[i]=NAN in case a[i] or b[i] is infinite, for some i. + * + * + */ + +/* Offsets for data table __svml_shypot_data_internal + */ +#define _sAbsMask 0 +#define _sHalf 64 +#define _iExpBound 128 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16vv_hypotf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $256, %rsp + vgetexpps {sae}, %zmm0, %zmm2 + vgetexpps {sae}, %zmm1, %zmm3 + vmovups _sHalf+__svml_shypot_data_internal(%rip), %zmm6 + vmaxps {sae}, %zmm3, %zmm2, %zmm4 + vmulps {rn-sae}, %zmm0, %zmm0, %zmm2 + vandps _sAbsMask+__svml_shypot_data_internal(%rip), %zmm4, %zmm5 + vfmadd231ps {rn-sae}, %zmm1, %zmm1, %zmm2 + vpcmpd $5, _iExpBound+__svml_shypot_data_internal(%rip), %zmm5, %k0 + vrsqrt14ps %zmm2, %zmm7 + kmovw %k0, %edx + vmulps {rn-sae}, %zmm7, %zmm2, %zmm9 + vmulps {rn-sae}, %zmm7, %zmm6, %zmm8 + vfnmadd231ps {rn-sae}, %zmm9, %zmm9, %zmm2 + vfmadd213ps {rn-sae}, %zmm9, %zmm8, %zmm2 + +/* + * VSCALEF( S, _VRES1, _VRES1, sExp ); + * The end of implementation + */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 zmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm2, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm1, 128(%rsp) + vmovups %zmm2, 192(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm2 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 192(%rsp), %zmm2 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm2 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + movss 128(%rsp,%r14,4), %xmm1 + call hypotf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 192(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16vv_hypotf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_shypot_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _sAbsMask[16][1]; + __declspec(align(64)) VUINT32 _sHalf[16][1]; + __declspec(align(64)) VUINT32 _iExpBound[16][1]; +} __svml_shypot_data_internal; +#endif +__svml_shypot_data_internal: + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _sAbsMask */ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sHalf */ + /* fma based algorithm*/ + .align 64 + .long 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000, 0x427C0000 /* _iExpBound */ + .align 64 + .type __svml_shypot_data_internal,@object + .size __svml_shypot_data_internal,.-__svml_shypot_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core-sse2.S new file mode 100644 index 0000000000..5e9dd22d94 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized hypotf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4vv_hypotf _ZGVbN4vv_hypotf_sse2 +#include "../svml_s_hypotf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core.c new file mode 100644 index 0000000000..91c9f5ca3f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized hypotf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4vv_hypotf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4vv_hypotf, __GI__ZGVbN4vv_hypotf, + __redirect__ZGVbN4vv_hypotf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core_sse4.S new file mode 100644 index 0000000000..a3f6d21ce1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf4_core_sse4.S @@ -0,0 +1,265 @@ +/* Function hypotf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * HIGH LEVEL OVERVIEW + * + * Calculate z = (x*x+y*y) + * Calculate reciplicle sqrt (z) + * Calculate make two NR iterations + * + * ALGORITHM DETAILS + * + * Multiprecision branch for _HA_ only + * Remove sigm from both arguments + * Find maximum (_x) and minimum (_y) (by abs value) between arguments + * Split _x int _a and _b for multiprecision + * If _x >> _y we will we will not split _y for multiprecision + * all _y will be put into lower part (_d) and higher part (_c = 0) + * Fixing _hilo_mask for the case _x >> _y + * Split _y into _c and _d for multiprecision with fixed mask + * + * compute Hi and Lo parts of _z = _x*_x + _y*_y + * + * _zHi = _a*_a + _c*_c + * _zLo = (_x + _a)*_b + _d*_y + _d*_c + * _z = _zHi + _zLo + * + * No multiprecision branch for _LA_ and _EP_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + * + * Check _z exponent to be withing borders [1E3 ; 60A] else goto Callout + * + * Compute resciplicle sqrt s0 ~ 1.0/sqrt(_z), + * that multiplied by _z, is final result for _EP_ version. + * + * First iteration (or zero iteration): + * s = z * s0 + * h = .5 * s0 + * d = s * h - .5 + * + * Second iteration: + * h = d * h + h + * s = s * d + s + * d = s * s - z (in multiprecision for _HA_) + * + * result = s - h * d + * + * EP version of the function can be implemented as y[i]=sqrt(a[i]^2+b[i]^2) + * with all intermediate operations done in target precision for i=1,..,n. + * It can return result y[i]=0 in case a[i]^2 and b[i]^2 underflow in target + * precision (for some i). It can return result y[i]=NAN in case + * a[i]^2+b[i]^2 overflow in target precision, for some i. It can return + * result y[i]=NAN in case a[i] or b[i] is infinite, for some i. + * + * + */ + +/* Offsets for data table __svml_shypot_data_internal + */ +#define _sHiLoMask 0 +#define _sAbsMask 16 +#define _sHalf 32 +#define _LowBoundary 48 +#define _HighBoundary 64 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4vv_hypotf_sse4) + subq $88, %rsp + cfi_def_cfa_offset(96) + +/* + * Implementation + * Multiprecision branch for _HA_ only + * No multiprecision branch for _LA_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + */ + movaps %xmm0, %xmm8 + movaps %xmm1, %xmm2 + mulps %xmm0, %xmm8 + mulps %xmm1, %xmm2 + +/* + * Variables + * Defines + * Constants loading + */ + movups _sHalf+__svml_shypot_data_internal(%rip), %xmm5 + addps %xmm2, %xmm8 + +/* _s0 ~ 1.0/sqrt(_z) */ + rsqrtps %xmm8, %xmm10 + +/* First iteration */ + movaps %xmm10, %xmm2 + movaps %xmm8, %xmm3 + mulps %xmm8, %xmm2 + mulps %xmm5, %xmm10 + movaps %xmm2, %xmm6 + mulps %xmm10, %xmm6 + +/* Check _z exponent to be withing borders [1E3 ; 60A] else goto Callout */ + movdqu _LowBoundary+__svml_shypot_data_internal(%rip), %xmm4 + subps %xmm6, %xmm5 + +/* Second iteration */ + movaps %xmm5, %xmm7 + pcmpgtd %xmm8, %xmm4 + mulps %xmm2, %xmm5 + mulps %xmm10, %xmm7 + addps %xmm5, %xmm2 + addps %xmm7, %xmm10 + +/* Finish second iteration in native precision for _LA_ */ + movaps %xmm2, %xmm9 + mulps %xmm2, %xmm9 + pcmpgtd _HighBoundary+__svml_shypot_data_internal(%rip), %xmm3 + subps %xmm8, %xmm9 + mulps %xmm9, %xmm10 + por %xmm3, %xmm4 + movmskps %xmm4, %edx + subps %xmm10, %xmm2 + +/* The end of implementation */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 xmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm2, %xmm0 + addq $88, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(96) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + movups %xmm2, 64(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -80) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -88) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -96) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 64(%rsp), %xmm2 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -80) + cfi_offset(13, -88) + cfi_offset(14, -96) + # LOE rbx rbp r12 r13 r14 r15 xmm2 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + movss 48(%rsp,%r14,4), %xmm1 + call hypotf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4vv_hypotf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_shypot_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _sHiLoMask[4][1]; + __declspec(align(16)) VUINT32 _sAbsMask[4][1]; + __declspec(align(16)) VUINT32 _sHalf[4][1]; + __declspec(align(16)) VUINT32 _LowBoundary[4][1]; + __declspec(align(16)) VUINT32 _HighBoundary[4][1]; +} __svml_shypot_data_internal; +#endif +__svml_shypot_data_internal: + /* legacy algorithm */ + .long 0xFFF80000, 0xFFF80000, 0xFFF80000, 0xFFF80000 /* _sHiLoMask */ + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _sAbsMask */ + .align 16 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sHalf */ + .align 16 + .long 0x1E300000, 0x1E300000, 0x1E300000, 0x1E300000 /* _LowBoundary */ + .align 16 + .long 0x60A00000, 0x60A00000, 0x60A00000, 0x60A00000 /* _HighBoundary */ + .align 16 + .type __svml_shypot_data_internal,@object + .size __svml_shypot_data_internal,.-__svml_shypot_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core-sse.S new file mode 100644 index 0000000000..d37556e331 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized hypotf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8vv_hypotf _ZGVdN8vv_hypotf_sse_wrapper +#include "../svml_s_hypotf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core.c new file mode 100644 index 0000000000..6cc497e73d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized sinf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8vv_hypotf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8vv_hypotf, __GI__ZGVdN8vv_hypotf, + __redirect__ZGVdN8vv_hypotf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core_avx2.S new file mode 100644 index 0000000000..733022ff01 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_hypotf8_core_avx2.S @@ -0,0 +1,269 @@ +/* Function hypotf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * HIGH LEVEL OVERVIEW + * + * Calculate z = (x*x+y*y) + * Calculate reciplicle sqrt (z) + * Calculate make two NR iterations + * + * ALGORITHM DETAILS + * + * Multiprecision branch for _HA_ only + * Remove sigm from both arguments + * Find maximum (_x) and minimum (_y) (by abs value) between arguments + * Split _x int _a and _b for multiprecision + * If _x >> _y we will we will not split _y for multiprecision + * all _y will be put into lower part (_d) and higher part (_c = 0) + * Fixing _hilo_mask for the case _x >> _y + * Split _y into _c and _d for multiprecision with fixed mask + * + * compute Hi and Lo parts of _z = _x*_x + _y*_y + * + * _zHi = _a*_a + _c*_c + * _zLo = (_x + _a)*_b + _d*_y + _d*_c + * _z = _zHi + _zLo + * + * No multiprecision branch for _LA_ and _EP_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + * + * Check _z exponent to be withing borders [1E3 ; 60A] else goto Callout + * + * Compute resciplicle sqrt s0 ~ 1.0/sqrt(_z), + * that multiplied by _z, is final result for _EP_ version. + * + * First iteration (or zero iteration): + * s = z * s0 + * h = .5 * s0 + * d = s * h - .5 + * + * Second iteration: + * h = d * h + h + * s = s * d + s + * d = s * s - z (in multiprecision for _HA_) + * + * result = s - h * d + * + * EP version of the function can be implemented as y[i]=sqrt(a[i]^2+b[i]^2) + * with all intermediate operations done in target precision for i=1,..,n. + * It can return result y[i]=0 in case a[i]^2 and b[i]^2 underflow in target + * precision (for some i). It can return result y[i]=NAN in case + * a[i]^2+b[i]^2 overflow in target precision, for some i. It can return + * result y[i]=NAN in case a[i] or b[i] is infinite, for some i. + * + * + */ + +/* Offsets for data table __svml_shypot_data_internal + */ +#define _sHiLoMask 0 +#define _sAbsMask 32 +#define _sHalf 64 +#define _LowBoundary 96 +#define _HighBoundary 128 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8vv_hypotf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $128, %rsp + +/* + * Implementation + * Multiprecision branch for _HA_ only + * No multiprecision branch for _LA_ + * _z = _VARG1 * _VARG1 + _VARG2 * _VARG2 + */ + vmulps %ymm0, %ymm0, %ymm8 + +/* + * Variables + * Defines + * Constants loading + */ + vmovups _sHalf+__svml_shypot_data_internal(%rip), %ymm7 + +/* Check _z exponent to be withing borders [1E3 ; 60A] else goto Callout */ + vmovups _LowBoundary+__svml_shypot_data_internal(%rip), %ymm2 + vfmadd231ps %ymm1, %ymm1, %ymm8 + +/* _s0 ~ 1.0/sqrt(_z) */ + vrsqrtps %ymm8, %ymm6 + vpcmpgtd %ymm8, %ymm2, %ymm3 + +/* First iteration */ + vmulps %ymm8, %ymm6, %ymm9 + vmulps %ymm7, %ymm6, %ymm2 + vfnmadd231ps %ymm9, %ymm2, %ymm7 + vfmadd213ps %ymm9, %ymm7, %ymm9 + +/* Second iteration */ + vfmadd132ps %ymm7, %ymm2, %ymm2 + vpcmpgtd _HighBoundary+__svml_shypot_data_internal(%rip), %ymm8, %ymm4 + vpor %ymm4, %ymm3, %ymm5 + +/* Finish second iteration in native precision for _LA_ */ + vfmsub231ps %ymm9, %ymm9, %ymm8 + vmovmskps %ymm5, %edx + vfnmadd213ps %ymm9, %ymm8, %ymm2 + +/* The end of implementation */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm1 ymm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %ymm2, %ymm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm0, 32(%rsp) + vmovups %ymm1, 64(%rsp) + vmovups %ymm2, 96(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm2 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 96(%rsp), %ymm2 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm2 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + movss 64(%rsp,%r14,4), %xmm1 + call hypotf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 96(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8vv_hypotf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_shypot_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _sHiLoMask[8][1]; + __declspec(align(32)) VUINT32 _sAbsMask[8][1]; + __declspec(align(32)) VUINT32 _sHalf[8][1]; + __declspec(align(32)) VUINT32 _LowBoundary[8][1]; + __declspec(align(32)) VUINT32 _HighBoundary[8][1]; +} __svml_shypot_data_internal; +#endif +__svml_shypot_data_internal: + /* legacy algorithm */ + .long 0xFFF80000, 0xFFF80000, 0xFFF80000, 0xFFF80000, 0xFFF80000, 0xFFF80000, 0xFFF80000, 0xFFF80000 /* _sHiLoMask */ + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _sAbsMask */ + .align 32 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sHalf */ + .align 32 + .long 0x1E300000, 0x1E300000, 0x1E300000, 0x1E300000, 0x1E300000, 0x1E300000, 0x1E300000, 0x1E300000 /* _LowBoundary */ + .align 32 + .long 0x60A00000, 0x60A00000, 0x60A00000, 0x60A00000, 0x60A00000, 0x60A00000, 0x60A00000, 0x60A00000 /* _HighBoundary */ + .align 32 + .type __svml_shypot_data_internal,@object + .size __svml_shypot_data_internal,.-__svml_shypot_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_hypot2_core.S b/sysdeps/x86_64/fpu/svml_d_hypot2_core.S new file mode 100644 index 0000000000..ea98f36324 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_hypot2_core.S @@ -0,0 +1,29 @@ +/* Function hypot vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2vv_hypot) +WRAPPER_IMPL_SSE2_ff hypot +END (_ZGVbN2vv_hypot) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2vv_hypot) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_hypot4_core.S b/sysdeps/x86_64/fpu/svml_d_hypot4_core.S new file mode 100644 index 0000000000..cedbbff2b6 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_hypot4_core.S @@ -0,0 +1,29 @@ +/* Function hypot vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4vv_hypot) +WRAPPER_IMPL_AVX_ff _ZGVbN2vv_hypot +END (_ZGVdN4vv_hypot) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4vv_hypot) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_hypot4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_hypot4_core_avx.S new file mode 100644 index 0000000000..e0fef5203d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_hypot4_core_avx.S @@ -0,0 +1,25 @@ +/* Function hypot vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4vv_hypot) +WRAPPER_IMPL_AVX_ff _ZGVbN2vv_hypot +END (_ZGVcN4vv_hypot) diff --git a/sysdeps/x86_64/fpu/svml_d_hypot8_core.S b/sysdeps/x86_64/fpu/svml_d_hypot8_core.S new file mode 100644 index 0000000000..7588e4407b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_hypot8_core.S @@ -0,0 +1,25 @@ +/* Function hypot vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8vv_hypot) +WRAPPER_IMPL_AVX512_ff _ZGVdN4vv_hypot +END (_ZGVeN8vv_hypot) diff --git a/sysdeps/x86_64/fpu/svml_s_hypotf16_core.S b/sysdeps/x86_64/fpu/svml_s_hypotf16_core.S new file mode 100644 index 0000000000..06d421a926 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_hypotf16_core.S @@ -0,0 +1,25 @@ +/* Function hypotf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16vv_hypotf) +WRAPPER_IMPL_AVX512_ff _ZGVdN8vv_hypotf +END (_ZGVeN16vv_hypotf) diff --git a/sysdeps/x86_64/fpu/svml_s_hypotf4_core.S b/sysdeps/x86_64/fpu/svml_s_hypotf4_core.S new file mode 100644 index 0000000000..7e8553cae4 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_hypotf4_core.S @@ -0,0 +1,29 @@ +/* Function hypotf vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4vv_hypotf) +WRAPPER_IMPL_SSE2_ff hypotf +END (_ZGVbN4vv_hypotf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4vv_hypotf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_hypotf8_core.S b/sysdeps/x86_64/fpu/svml_s_hypotf8_core.S new file mode 100644 index 0000000000..a9bf27370b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_hypotf8_core.S @@ -0,0 +1,29 @@ +/* Function hypotf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8vv_hypotf) +WRAPPER_IMPL_AVX_ff _ZGVbN4vv_hypotf +END (_ZGVdN8vv_hypotf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8vv_hypotf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_hypotf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_hypotf8_core_avx.S new file mode 100644 index 0000000000..8b8008a7e9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_hypotf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function hypotf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY(_ZGVcN8vv_hypotf) +WRAPPER_IMPL_AVX_ff _ZGVbN4vv_hypotf +END(_ZGVcN8vv_hypotf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx.c new file mode 100644 index 0000000000..c6a26a63e4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-hypot.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx2.c new file mode 100644 index 0000000000..c6a26a63e4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-hypot.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx512f.c new file mode 100644 index 0000000000..c6a26a63e4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-hypot-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-hypot.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-hypot.c b/sysdeps/x86_64/fpu/test-double-libmvec-hypot.c new file mode 100644 index 0000000000..c0f600a443 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-hypot.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC hypot +#include "test-vector-abi-arg2.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 5746bb5be3..9bc9d1dafa 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -30,6 +30,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVbN2vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVbN2v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVbN2v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVbN2v_asin) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVbN2vv_hypot) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 8d3d5493ed..c41994d90a 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVdN4vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVdN4v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVdN4v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVdN4v_asin) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVdN4vv_hypot) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index f43328f2ff..881f6c801a 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -30,6 +30,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVcN4vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVcN4v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVcN4v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVcN4v_asin) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVcN4vv_hypot) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 8b566c199a..6fd106fe68 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -30,6 +30,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (pow), _ZGVeN8vv_pow) VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVeN8v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVeN8v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVeN8v_asin) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVeN8vv_hypot) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx.c new file mode 100644 index 0000000000..97d11ad1d3 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-hypotf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx2.c new file mode 100644 index 0000000000..97d11ad1d3 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-hypotf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx512f.c new file mode 100644 index 0000000000..97d11ad1d3 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-hypotf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-hypotf.c b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf.c new file mode 100644 index 0000000000..38776fa724 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-hypotf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC hypotf +#include "test-vector-abi-arg2.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 3d3218a310..4c2ea6ddfe 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -30,6 +30,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVeN16vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVeN16v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVeN16v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVeN16v_asinf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVeN16vv_hypotf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 7d75b9f60f..1d5d952d07 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -30,6 +30,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVbN4vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVbN4v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVbN4v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVbN4v_asinf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVbN4vv_hypotf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 405dde49bc..7a750f3781 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVdN8vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVdN8v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVdN8v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVdN8v_asinf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVdN8vv_hypotf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 7558443f2e..af816a7789 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -30,6 +30,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (powf), _ZGVcN8vv_powf) VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVcN8v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVcN8v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVcN8v_asinf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVcN8vv_hypotf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573811 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=xKLo6Yes; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm0F5QqJz9sVq for ; Wed, 29 Dec 2021 07:11:57 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DE1013858002 for ; Tue, 28 Dec 2021 20:11:54 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DE1013858002 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722314; bh=1/zLjJbO7HPFd7Yu4OTnPQ0kQ8u+I6ezBO9MjR+uu4E=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=xKLo6YesTDsp56nliXO7sqw2MXtZHS0NoZpx60pPCbTbcUkuve05cq6irJOnqnygR PTPbYckraOIFZLQ0+24Rgfs7Ue0xbExFpjP7ZvQAUKv+MOItUXYWHrunggbGTLMYWp OF5piKJYUPpqG46y/ou5kju65Uy1NCSXSFbzJzSg= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id 06EA93858D39 for ; Tue, 28 Dec 2021 20:11:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 06EA93858D39 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="228246030" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="228246030" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="554319592" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga001.jf.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsY016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 04/18] x86-64: Add vector exp2/exp2f implementation to libmvec Date: Tue, 28 Dec 2021 12:11:16 -0800 Message-Id: <20211228201130.737370-5-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized exp2/exp2f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector exp2/exp2f with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_exp22_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_exp22_core.c | 27 ++ .../fpu/multiarch/svml_d_exp22_core_sse4.S | 325 +++++++++++++++++ .../fpu/multiarch/svml_d_exp24_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_exp24_core.c | 27 ++ .../fpu/multiarch/svml_d_exp24_core_avx2.S | 341 ++++++++++++++++++ .../fpu/multiarch/svml_d_exp28_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_exp28_core.c | 27 ++ .../fpu/multiarch/svml_d_exp28_core_avx512.S | 301 ++++++++++++++++ .../fpu/multiarch/svml_s_exp2f16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_exp2f16_core.c | 28 ++ .../multiarch/svml_s_exp2f16_core_avx512.S | 271 ++++++++++++++ .../fpu/multiarch/svml_s_exp2f4_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_s_exp2f4_core.c | 28 ++ .../fpu/multiarch/svml_s_exp2f4_core_sse4.S | 238 ++++++++++++ .../fpu/multiarch/svml_s_exp2f8_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_s_exp2f8_core.c | 28 ++ .../fpu/multiarch/svml_s_exp2f8_core_avx2.S | 245 +++++++++++++ sysdeps/x86_64/fpu/svml_d_exp22_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_exp24_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_exp24_core_avx.S | 25 ++ sysdeps/x86_64/fpu/svml_d_exp28_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_exp2f16_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_exp2f4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_exp2f8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_exp2f8_core_avx.S | 25 ++ .../x86_64/fpu/test-double-libmvec-exp2-avx.c | 1 + .../fpu/test-double-libmvec-exp2-avx2.c | 1 + .../fpu/test-double-libmvec-exp2-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-exp2.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-exp2f-avx.c | 1 + .../fpu/test-float-libmvec-exp2f-avx2.c | 1 + .../fpu/test-float-libmvec-exp2f-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-exp2f.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 2293 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp22_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp24_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp24_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp28_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp2f16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp2f4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp2f8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp2f8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp2f.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index adf65f6bc2..36d6643eb9 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -142,4 +142,15 @@ #define __DECL_SIMD_hypotf32x #define __DECL_SIMD_hypotf64x #define __DECL_SIMD_hypotf128x + +#define __DECL_SIMD_exp2 +#define __DECL_SIMD_exp2f +#define __DECL_SIMD_exp2l +#define __DECL_SIMD_exp2f16 +#define __DECL_SIMD_exp2f32 +#define __DECL_SIMD_exp2f64 +#define __DECL_SIMD_exp2f128 +#define __DECL_SIMD_exp2f32x +#define __DECL_SIMD_exp2f64x +#define __DECL_SIMD_exp2f128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 2ed820a0dc..645088cbf3 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -127,7 +127,7 @@ __MATHCALL (logb,, (_Mdouble_ __x)); #ifdef __USE_ISOC99 /* Compute base-2 exponential of X. */ -__MATHCALL (exp2,, (_Mdouble_ __x)); +__MATHCALL_VEC (exp2,, (_Mdouble_ __x)); /* Compute base-2 logarithm of X. */ __MATHCALL (log2,, (_Mdouble_ __x)); diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 12bb03245b..1717f2dee9 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -49,32 +49,40 @@ GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F +GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F +GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F +GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F +GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F +GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F +GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F +GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F +GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 437977c5fd..c7a972521b 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -74,6 +74,10 @@ # define __DECL_SIMD_hypot __DECL_SIMD_x86_64 # undef __DECL_SIMD_hypotf # define __DECL_SIMD_hypotf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_exp2 +# define __DECL_SIMD_exp2 __DECL_SIMD_x86_64 +# undef __DECL_SIMD_exp2f +# define __DECL_SIMD_exp2f __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index cda31479a6..0994e6dfac 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -36,6 +36,8 @@ !GCC$ builtin (asinf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (hypot) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (hypotf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (exp2) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (exp2f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -57,3 +59,5 @@ !GCC$ builtin (asinf) attributes simd (notinbranch) if('x32') !GCC$ builtin (hypot) attributes simd (notinbranch) if('x32') !GCC$ builtin (hypotf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (exp2) attributes simd (notinbranch) if('x32') +!GCC$ builtin (exp2f) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 7769a02731..03b2364417 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -27,6 +27,7 @@ libmvec-funcs = \ atan \ cos \ exp \ + exp2 \ hypot \ log \ pow \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index e359e5dc2c..12b7ad1830 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -17,10 +17,12 @@ libmvec { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; + _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; + _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; } } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index a7513ec94e..bc4479ad39 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1276,6 +1276,26 @@ float: 1 float128: 2 ldouble: 1 +Function: "exp2_vlen16": +float: 1 + +Function: "exp2_vlen2": +double: 1 + +Function: "exp2_vlen4": +double: 1 +float: 1 + +Function: "exp2_vlen4_avx2": +double: 1 + +Function: "exp2_vlen8": +double: 1 +float: 1 + +Function: "exp2_vlen8_avx2": +float: 1 + Function: "exp_downward": double: 1 float: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core-sse2.S new file mode 100644 index 0000000000..330260baaa --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized exp2, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_exp2 _ZGVbN2v_exp2_sse2 +#include "../svml_d_exp22_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core.c new file mode 100644 index 0000000000..e0cf198030 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized exp2, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_exp2 +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_exp2, __GI__ZGVbN2v_exp2, __redirect__ZGVbN2v_exp2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core_sse4.S new file mode 100644 index 0000000000..7388c242f6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp22_core_sse4.S @@ -0,0 +1,325 @@ +/* Function exp2 vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp2(x) = 2^n * T[j] * (1 + P(y)) + * where + * x = m*(1/K) + y, y in [-1/K..1/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp2(x)-1 + * on small interval [-1/K..1/K] + * + * Special cases: + * + * exp2(NaN) = NaN + * exp2(+INF) = +INF + * exp2(-INF) = 0 + * exp2(x) = 1 for subnormals + * For IEEE double + * if x >= 1024.0 then exp2(x) overflows + * if x < -1076.0 then exp2(x) underflows + * + */ + +/* Offsets for data table __svml_dexp2_data_internal + */ +#define _dbT 0 +#define _dbShifter 1024 +#define _dPC1 1040 +#define _dPC2 1056 +#define _dPC3 1072 +#define _dPC4 1088 +#define _lIndexMask 1104 +#define _iAbsMask 1120 +#define _iDomainRange 1136 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_exp2_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* R */ + movaps %xmm0, %xmm7 + movups _dbShifter+__svml_dexp2_data_internal(%rip), %xmm1 + +/* out, basePtr, iIndex, iBaseOfs, iSize, iGran, iOfs */ + lea __svml_dexp2_data_internal(%rip), %rsi + +/* Load arument */ + movaps %xmm1, %xmm10 + addpd %xmm0, %xmm10 + movaps %xmm10, %xmm6 + subpd %xmm1, %xmm6 + subpd %xmm6, %xmm7 + +/* + * Polynomial + * poly(dN) = a1*dR+...+a4*dR^4 + */ + movups _dPC4+__svml_dexp2_data_internal(%rip), %xmm8 + mulpd %xmm7, %xmm8 + addpd _dPC3+__svml_dexp2_data_internal(%rip), %xmm8 + mulpd %xmm7, %xmm8 + addpd _dPC2+__svml_dexp2_data_internal(%rip), %xmm8 + movdqu _lIndexMask+__svml_dexp2_data_internal(%rip), %xmm9 + +/* Index and lookup */ + movdqa %xmm9, %xmm5 + pandn %xmm10, %xmm9 + pand %xmm10, %xmm5 + +/* 2^N */ + psllq $45, %xmm9 + movd %xmm5, %eax + movq _iAbsMask+__svml_dexp2_data_internal(%rip), %xmm2 + +/* Check for overflow\underflow */ + pshufd $221, %xmm0, %xmm4 + pextrw $4, %xmm5, %ecx + +/* a1+...+a4*dR^3 ! */ + mulpd %xmm7, %xmm8 + shll $3, %eax + pand %xmm2, %xmm4 + shll $3, %ecx + movq (%rsi,%rax), %xmm1 + movhpd (%rsi,%rcx), %xmm1 + +/* dR=dR*dT */ + mulpd %xmm1, %xmm7 + addpd _dPC1+__svml_dexp2_data_internal(%rip), %xmm8 + +/* + * Reconstruction + * exp2 = {2^N later}*(Tj+Tj*poly) + * dN = dT+dT*dR*(a1+...+a4*dR^3) + */ + mulpd %xmm7, %xmm8 + addpd %xmm8, %xmm1 + movq _iDomainRange+__svml_dexp2_data_internal(%rip), %xmm3 + pcmpgtd %xmm3, %xmm4 + movmskps %xmm4, %edx + +/* quick 2^N */ + paddq %xmm9, %xmm1 + andl $3, %edx + +/* Finish */ + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm1, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx xmm1 + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm1 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call exp2@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2v_exp2_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dexp2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dbT[(1<<7)][2]; + __declspec(align(16)) VUINT32 _dbShifter[2][2]; + __declspec(align(16)) VUINT32 _dPC1[2][2]; + __declspec(align(16)) VUINT32 _dPC2[2][2]; + __declspec(align(16)) VUINT32 _dPC3[2][2]; + __declspec(align(16)) VUINT32 _dPC4[2][2]; + __declspec(align(16)) VUINT32 _lIndexMask[2][2]; + __declspec(align(16)) VUINT32 _iAbsMask[4][1]; + __declspec(align(16)) VUINT32 _iDomainRange[4][1]; +} __svml_dexp2_data_internal; +#endif +__svml_dexp2_data_internal: + /*== _dbT ==*/ + .quad 0x3ff0000000000000, 0x3ff0163da9fb3335 /*2^( 0 /128),2^( 1 /128)*/ + .quad 0x3ff02c9a3e778061, 0x3ff04315e86e7f85 /*2^( 2 /128),2^( 3 /128)*/ + .quad 0x3ff059b0d3158574, 0x3ff0706b29ddf6de /*2^( 4 /128),2^( 5 /128)*/ + .quad 0x3ff0874518759bc8, 0x3ff09e3ecac6f383 /*2^( 6 /128),2^( 7 /128)*/ + .quad 0x3ff0b5586cf9890f, 0x3ff0cc922b7247f7 /*2^( 8 /128),2^( 9 /128)*/ + .quad 0x3ff0e3ec32d3d1a2, 0x3ff0fb66affed31b /*2^( 10 /128),2^( 11 /128)*/ + .quad 0x3ff11301d0125b51, 0x3ff12abdc06c31cc /*2^( 12 /128),2^( 13 /128)*/ + .quad 0x3ff1429aaea92de0, 0x3ff15a98c8a58e51 /*2^( 14 /128),2^( 15 /128)*/ + .quad 0x3ff172b83c7d517b, 0x3ff18af9388c8dea /*2^( 16 /128),2^( 17 /128)*/ + .quad 0x3ff1a35beb6fcb75, 0x3ff1bbe084045cd4 /*2^( 18 /128),2^( 19 /128)*/ + .quad 0x3ff1d4873168b9aa, 0x3ff1ed5022fcd91d /*2^( 20 /128),2^( 21 /128)*/ + .quad 0x3ff2063b88628cd6, 0x3ff21f49917ddc96 /*2^( 22 /128),2^( 23 /128)*/ + .quad 0x3ff2387a6e756238, 0x3ff251ce4fb2a63f /*2^( 24 /128),2^( 25 /128)*/ + .quad 0x3ff26b4565e27cdd, 0x3ff284dfe1f56381 /*2^( 26 /128),2^( 27 /128)*/ + .quad 0x3ff29e9df51fdee1, 0x3ff2b87fd0dad990 /*2^( 28 /128),2^( 29 /128)*/ + .quad 0x3ff2d285a6e4030b, 0x3ff2ecafa93e2f56 /*2^( 30 /128),2^( 31 /128)*/ + .quad 0x3ff306fe0a31b715, 0x3ff32170fc4cd831 /*2^( 32 /128),2^( 33 /128)*/ + .quad 0x3ff33c08b26416ff, 0x3ff356c55f929ff1 /*2^( 34 /128),2^( 35 /128)*/ + .quad 0x3ff371a7373aa9cb, 0x3ff38cae6d05d866 /*2^( 36 /128),2^( 37 /128)*/ + .quad 0x3ff3a7db34e59ff7, 0x3ff3c32dc313a8e5 /*2^( 38 /128),2^( 39 /128)*/ + .quad 0x3ff3dea64c123422, 0x3ff3fa4504ac801c /*2^( 40 /128),2^( 41 /128)*/ + .quad 0x3ff4160a21f72e2a, 0x3ff431f5d950a897 /*2^( 42 /128),2^( 43 /128)*/ + .quad 0x3ff44e086061892d, 0x3ff46a41ed1d0057 /*2^( 44 /128),2^( 45 /128)*/ + .quad 0x3ff486a2b5c13cd0, 0x3ff4a32af0d7d3de /*2^( 46 /128),2^( 47 /128)*/ + .quad 0x3ff4bfdad5362a27, 0x3ff4dcb299fddd0d /*2^( 48 /128),2^( 49 /128)*/ + .quad 0x3ff4f9b2769d2ca7, 0x3ff516daa2cf6642 /*2^( 50 /128),2^( 51 /128)*/ + .quad 0x3ff5342b569d4f82, 0x3ff551a4ca5d920f /*2^( 52 /128),2^( 53 /128)*/ + .quad 0x3ff56f4736b527da, 0x3ff58d12d497c7fd /*2^( 54 /128),2^( 55 /128)*/ + .quad 0x3ff5ab07dd485429, 0x3ff5c9268a5946b7 /*2^( 56 /128),2^( 57 /128)*/ + .quad 0x3ff5e76f15ad2148, 0x3ff605e1b976dc09 /*2^( 58 /128),2^( 59 /128)*/ + .quad 0x3ff6247eb03a5585, 0x3ff6434634ccc320 /*2^( 60 /128),2^( 61 /128)*/ + .quad 0x3ff6623882552225, 0x3ff68155d44ca973 /*2^( 62 /128),2^( 63 /128)*/ + .quad 0x3ff6a09e667f3bcd, 0x3ff6c012750bdabf /*2^( 64 /128),2^( 65 /128)*/ + .quad 0x3ff6dfb23c651a2f, 0x3ff6ff7df9519484 /*2^( 66 /128),2^( 67 /128)*/ + .quad 0x3ff71f75e8ec5f74, 0x3ff73f9a48a58174 /*2^( 68 /128),2^( 69 /128)*/ + .quad 0x3ff75feb564267c9, 0x3ff780694fde5d3f /*2^( 70 /128),2^( 71 /128)*/ + .quad 0x3ff7a11473eb0187, 0x3ff7c1ed0130c132 /*2^( 72 /128),2^( 73 /128)*/ + .quad 0x3ff7e2f336cf4e62, 0x3ff80427543e1a12 /*2^( 74 /128),2^( 75 /128)*/ + .quad 0x3ff82589994cce13, 0x3ff8471a4623c7ad /*2^( 76 /128),2^( 77 /128)*/ + .quad 0x3ff868d99b4492ed, 0x3ff88ac7d98a6699 /*2^( 78 /128),2^( 79 /128)*/ + .quad 0x3ff8ace5422aa0db, 0x3ff8cf3216b5448c /*2^( 80 /128),2^( 81 /128)*/ + .quad 0x3ff8f1ae99157736, 0x3ff9145b0b91ffc6 /*2^( 82 /128),2^( 83 /128)*/ + .quad 0x3ff93737b0cdc5e5, 0x3ff95a44cbc8520f /*2^( 84 /128),2^( 85 /128)*/ + .quad 0x3ff97d829fde4e50, 0x3ff9a0f170ca07ba /*2^( 86 /128),2^( 87 /128)*/ + .quad 0x3ff9c49182a3f090, 0x3ff9e86319e32323 /*2^( 88 /128),2^( 89 /128)*/ + .quad 0x3ffa0c667b5de565, 0x3ffa309bec4a2d33 /*2^( 90 /128),2^( 91 /128)*/ + .quad 0x3ffa5503b23e255d, 0x3ffa799e1330b358 /*2^( 92 /128),2^( 93 /128)*/ + .quad 0x3ffa9e6b5579fdbf, 0x3ffac36bbfd3f37a /*2^( 94 /128),2^( 95 /128)*/ + .quad 0x3ffae89f995ad3ad, 0x3ffb0e07298db666 /*2^( 96 /128),2^( 97 /128)*/ + .quad 0x3ffb33a2b84f15fb, 0x3ffb59728de5593a /*2^( 98 /128),2^( 99 /128)*/ + .quad 0x3ffb7f76f2fb5e47, 0x3ffba5b030a1064a /*2^( 100 /128),2^( 101 /128)*/ + .quad 0x3ffbcc1e904bc1d2, 0x3ffbf2c25bd71e09 /*2^( 102 /128),2^( 103 /128)*/ + .quad 0x3ffc199bdd85529c, 0x3ffc40ab5fffd07a /*2^( 104 /128),2^( 105 /128)*/ + .quad 0x3ffc67f12e57d14b, 0x3ffc8f6d9406e7b5 /*2^( 106 /128),2^( 107 /128)*/ + .quad 0x3ffcb720dcef9069, 0x3ffcdf0b555dc3fa /*2^( 108 /128),2^( 109 /128)*/ + .quad 0x3ffd072d4a07897c, 0x3ffd2f87080d89f2 /*2^( 110 /128),2^( 111 /128)*/ + .quad 0x3ffd5818dcfba487, 0x3ffd80e316c98398 /*2^( 112 /128),2^( 113 /128)*/ + .quad 0x3ffda9e603db3285, 0x3ffdd321f301b460 /*2^( 114 /128),2^( 115 /128)*/ + .quad 0x3ffdfc97337b9b5f, 0x3ffe264614f5a129 /*2^( 116 /128),2^( 117 /128)*/ + .quad 0x3ffe502ee78b3ff6, 0x3ffe7a51fbc74c83 /*2^( 118 /128),2^( 119 /128)*/ + .quad 0x3ffea4afa2a490da, 0x3ffecf482d8e67f1 /*2^( 120 /128),2^( 121 /128)*/ + .quad 0x3ffefa1bee615a27, 0x3fff252b376bba97 /*2^( 122 /128),2^( 123 /128)*/ + .quad 0x3fff50765b6e4540, 0x3fff7bfdad9cbe14 /*2^( 124 /128),2^( 125 /128)*/ + .quad 0x3fffa7c1819e90d8, 0x3fffd3c22b8f71f1 /*2^( 126 /128),2^( 127 /128)*/ + .align 16 + .quad 0x42c8000000000000, 0x42c8000000000000 /* _dbShifter - 0x433-7=0x42c shifted right on K!*/ + //log2(relerr) = -53.547756365162 + .align 16 + .quad 0x3fe62e42fefa3685, 0x3fe62e42fefa3685 /* _dPC1 */ + .align 16 + .quad 0x3fcebfbdff82ca48, 0x3fcebfbdff82ca48 /* _dPC2 */ + .align 16 + .quad 0x3fac6b09b180f045, 0x3fac6b09b180f045 /* _dPC3 */ + .align 16 + .quad 0x3f83b2ab5bb1268f, 0x3f83b2ab5bb1268f /* _dPC4 */ + .align 16 + .quad 0x000000000000007f, 0x000000000000007f /* _lIndexMask =(2^K-1)*/ + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 16 + .long 0x408fefff, 0x408fefff, 0x408fefff, 0x408fefff /* _iDomainRange */ + .align 16 + .type __svml_dexp2_data_internal,@object + .size __svml_dexp2_data_internal,.-__svml_dexp2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core-sse.S new file mode 100644 index 0000000000..51c5de1100 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized exp2, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_exp2 _ZGVdN4v_exp2_sse_wrapper +#include "../svml_d_exp24_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core.c new file mode 100644 index 0000000000..bb979afde6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized exp2, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_exp2 +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_exp2, __GI__ZGVdN4v_exp2, __redirect__ZGVdN4v_exp2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core_avx2.S new file mode 100644 index 0000000000..6aaadafeeb --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp24_core_avx2.S @@ -0,0 +1,341 @@ +/* Function exp2 vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp2(x) = 2^n * T[j] * (1 + P(y)) + * where + * x = m*(1/K) + y, y in [-1/K..1/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp2(x)-1 + * on small interval [-1/K..1/K] + * + * Special cases: + * + * exp2(NaN) = NaN + * exp2(+INF) = +INF + * exp2(-INF) = 0 + * exp2(x) = 1 for subnormals + * For IEEE double + * if x >= 1024.0 then exp2(x) overflows + * if x < -1076.0 then exp2(x) underflows + * + */ + +/* Offsets for data table __svml_dexp2_data_internal + */ +#define _dbT 0 +#define _dbShifter 1024 +#define _dPC1 1056 +#define _dPC2 1088 +#define _dPC3 1120 +#define _dPC4 1152 +#define _lIndexMask 1184 +#define _iAbsMask 1216 +#define _iDomainRange 1248 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_exp2_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + +/* out, basePtr, iIndex, iBaseOfs, iSize, iGran, iOfs */ + lea __svml_dexp2_data_internal(%rip), %r8 + vmovupd _dbShifter+__svml_dexp2_data_internal(%rip), %ymm4 + vmovupd _lIndexMask+__svml_dexp2_data_internal(%rip), %ymm3 + vmovapd %ymm0, %ymm1 + +/* Load arument */ + vaddpd %ymm4, %ymm1, %ymm2 + vsubpd %ymm4, %ymm2, %ymm0 + +/* Index and lookup */ + vandps %ymm3, %ymm2, %ymm9 + vpandn %ymm2, %ymm3, %ymm2 + +/* 2^N */ + vpsllq $45, %ymm2, %ymm3 + +/* R */ + vsubpd %ymm0, %ymm1, %ymm15 + +/* Check for overflow\underflow */ + vextractf128 $1, %ymm1, %xmm5 + +/* + * Polynomial + * poly(dN) = a1*dR+...+a4*dR^4 + */ + vmovupd _dPC4+__svml_dexp2_data_internal(%rip), %ymm0 + vshufps $221, %xmm5, %xmm1, %xmm6 + vandps _iAbsMask+__svml_dexp2_data_internal(%rip), %xmm6, %xmm7 + vpcmpgtd _iDomainRange+__svml_dexp2_data_internal(%rip), %xmm7, %xmm8 + vfmadd213pd _dPC3+__svml_dexp2_data_internal(%rip), %ymm15, %ymm0 + vmovmskps %xmm8, %eax + vfmadd213pd _dPC2+__svml_dexp2_data_internal(%rip), %ymm15, %ymm0 + +/* a1+...+a4*dR^3 ! */ + vfmadd213pd _dPC1+__svml_dexp2_data_internal(%rip), %ymm15, %ymm0 + vextractf128 $1, %ymm9, %xmm12 + vmovd %xmm9, %edx + vmovd %xmm12, %esi + shll $3, %edx + vpextrd $2, %xmm9, %ecx + shll $3, %esi + vpextrd $2, %xmm12, %edi + shll $3, %ecx + vmovq (%r8,%rdx), %xmm10 + shll $3, %edi + vmovq (%r8,%rsi), %xmm13 + vmovhpd (%r8,%rcx), %xmm10, %xmm11 + vmovhpd (%r8,%rdi), %xmm13, %xmm14 + vinsertf128 $1, %xmm14, %ymm11, %ymm4 + +/* dR=dR*dT */ + vmulpd %ymm15, %ymm4, %ymm15 + +/* + * Reconstruction + * exp2 = {2^N later}*(Tj+Tj*poly) + * dN = dT+dT*dR*(a1+...+a4*dR^3) + */ + vfmadd213pd %ymm4, %ymm15, %ymm0 + +/* quick 2^N */ + vpaddq %ymm3, %ymm0, %ymm0 + +/* Finish */ + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm1, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call exp2@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_exp2_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dexp2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dbT[(1<<7)][2]; + __declspec(align(32)) VUINT32 _dbShifter[4][2]; + __declspec(align(32)) VUINT32 _dPC1[4][2]; + __declspec(align(32)) VUINT32 _dPC2[4][2]; + __declspec(align(32)) VUINT32 _dPC3[4][2]; + __declspec(align(32)) VUINT32 _dPC4[4][2]; + __declspec(align(32)) VUINT32 _lIndexMask[4][2]; + __declspec(align(32)) VUINT32 _iAbsMask[8][1]; + __declspec(align(32)) VUINT32 _iDomainRange[8][1]; +} __svml_dexp2_data_internal; +#endif +__svml_dexp2_data_internal: + /*== _dbT ==*/ + .quad 0x3ff0000000000000, 0x3ff0163da9fb3335 /*2^( 0 /128),2^( 1 /128)*/ + .quad 0x3ff02c9a3e778061, 0x3ff04315e86e7f85 /*2^( 2 /128),2^( 3 /128)*/ + .quad 0x3ff059b0d3158574, 0x3ff0706b29ddf6de /*2^( 4 /128),2^( 5 /128)*/ + .quad 0x3ff0874518759bc8, 0x3ff09e3ecac6f383 /*2^( 6 /128),2^( 7 /128)*/ + .quad 0x3ff0b5586cf9890f, 0x3ff0cc922b7247f7 /*2^( 8 /128),2^( 9 /128)*/ + .quad 0x3ff0e3ec32d3d1a2, 0x3ff0fb66affed31b /*2^( 10 /128),2^( 11 /128)*/ + .quad 0x3ff11301d0125b51, 0x3ff12abdc06c31cc /*2^( 12 /128),2^( 13 /128)*/ + .quad 0x3ff1429aaea92de0, 0x3ff15a98c8a58e51 /*2^( 14 /128),2^( 15 /128)*/ + .quad 0x3ff172b83c7d517b, 0x3ff18af9388c8dea /*2^( 16 /128),2^( 17 /128)*/ + .quad 0x3ff1a35beb6fcb75, 0x3ff1bbe084045cd4 /*2^( 18 /128),2^( 19 /128)*/ + .quad 0x3ff1d4873168b9aa, 0x3ff1ed5022fcd91d /*2^( 20 /128),2^( 21 /128)*/ + .quad 0x3ff2063b88628cd6, 0x3ff21f49917ddc96 /*2^( 22 /128),2^( 23 /128)*/ + .quad 0x3ff2387a6e756238, 0x3ff251ce4fb2a63f /*2^( 24 /128),2^( 25 /128)*/ + .quad 0x3ff26b4565e27cdd, 0x3ff284dfe1f56381 /*2^( 26 /128),2^( 27 /128)*/ + .quad 0x3ff29e9df51fdee1, 0x3ff2b87fd0dad990 /*2^( 28 /128),2^( 29 /128)*/ + .quad 0x3ff2d285a6e4030b, 0x3ff2ecafa93e2f56 /*2^( 30 /128),2^( 31 /128)*/ + .quad 0x3ff306fe0a31b715, 0x3ff32170fc4cd831 /*2^( 32 /128),2^( 33 /128)*/ + .quad 0x3ff33c08b26416ff, 0x3ff356c55f929ff1 /*2^( 34 /128),2^( 35 /128)*/ + .quad 0x3ff371a7373aa9cb, 0x3ff38cae6d05d866 /*2^( 36 /128),2^( 37 /128)*/ + .quad 0x3ff3a7db34e59ff7, 0x3ff3c32dc313a8e5 /*2^( 38 /128),2^( 39 /128)*/ + .quad 0x3ff3dea64c123422, 0x3ff3fa4504ac801c /*2^( 40 /128),2^( 41 /128)*/ + .quad 0x3ff4160a21f72e2a, 0x3ff431f5d950a897 /*2^( 42 /128),2^( 43 /128)*/ + .quad 0x3ff44e086061892d, 0x3ff46a41ed1d0057 /*2^( 44 /128),2^( 45 /128)*/ + .quad 0x3ff486a2b5c13cd0, 0x3ff4a32af0d7d3de /*2^( 46 /128),2^( 47 /128)*/ + .quad 0x3ff4bfdad5362a27, 0x3ff4dcb299fddd0d /*2^( 48 /128),2^( 49 /128)*/ + .quad 0x3ff4f9b2769d2ca7, 0x3ff516daa2cf6642 /*2^( 50 /128),2^( 51 /128)*/ + .quad 0x3ff5342b569d4f82, 0x3ff551a4ca5d920f /*2^( 52 /128),2^( 53 /128)*/ + .quad 0x3ff56f4736b527da, 0x3ff58d12d497c7fd /*2^( 54 /128),2^( 55 /128)*/ + .quad 0x3ff5ab07dd485429, 0x3ff5c9268a5946b7 /*2^( 56 /128),2^( 57 /128)*/ + .quad 0x3ff5e76f15ad2148, 0x3ff605e1b976dc09 /*2^( 58 /128),2^( 59 /128)*/ + .quad 0x3ff6247eb03a5585, 0x3ff6434634ccc320 /*2^( 60 /128),2^( 61 /128)*/ + .quad 0x3ff6623882552225, 0x3ff68155d44ca973 /*2^( 62 /128),2^( 63 /128)*/ + .quad 0x3ff6a09e667f3bcd, 0x3ff6c012750bdabf /*2^( 64 /128),2^( 65 /128)*/ + .quad 0x3ff6dfb23c651a2f, 0x3ff6ff7df9519484 /*2^( 66 /128),2^( 67 /128)*/ + .quad 0x3ff71f75e8ec5f74, 0x3ff73f9a48a58174 /*2^( 68 /128),2^( 69 /128)*/ + .quad 0x3ff75feb564267c9, 0x3ff780694fde5d3f /*2^( 70 /128),2^( 71 /128)*/ + .quad 0x3ff7a11473eb0187, 0x3ff7c1ed0130c132 /*2^( 72 /128),2^( 73 /128)*/ + .quad 0x3ff7e2f336cf4e62, 0x3ff80427543e1a12 /*2^( 74 /128),2^( 75 /128)*/ + .quad 0x3ff82589994cce13, 0x3ff8471a4623c7ad /*2^( 76 /128),2^( 77 /128)*/ + .quad 0x3ff868d99b4492ed, 0x3ff88ac7d98a6699 /*2^( 78 /128),2^( 79 /128)*/ + .quad 0x3ff8ace5422aa0db, 0x3ff8cf3216b5448c /*2^( 80 /128),2^( 81 /128)*/ + .quad 0x3ff8f1ae99157736, 0x3ff9145b0b91ffc6 /*2^( 82 /128),2^( 83 /128)*/ + .quad 0x3ff93737b0cdc5e5, 0x3ff95a44cbc8520f /*2^( 84 /128),2^( 85 /128)*/ + .quad 0x3ff97d829fde4e50, 0x3ff9a0f170ca07ba /*2^( 86 /128),2^( 87 /128)*/ + .quad 0x3ff9c49182a3f090, 0x3ff9e86319e32323 /*2^( 88 /128),2^( 89 /128)*/ + .quad 0x3ffa0c667b5de565, 0x3ffa309bec4a2d33 /*2^( 90 /128),2^( 91 /128)*/ + .quad 0x3ffa5503b23e255d, 0x3ffa799e1330b358 /*2^( 92 /128),2^( 93 /128)*/ + .quad 0x3ffa9e6b5579fdbf, 0x3ffac36bbfd3f37a /*2^( 94 /128),2^( 95 /128)*/ + .quad 0x3ffae89f995ad3ad, 0x3ffb0e07298db666 /*2^( 96 /128),2^( 97 /128)*/ + .quad 0x3ffb33a2b84f15fb, 0x3ffb59728de5593a /*2^( 98 /128),2^( 99 /128)*/ + .quad 0x3ffb7f76f2fb5e47, 0x3ffba5b030a1064a /*2^( 100 /128),2^( 101 /128)*/ + .quad 0x3ffbcc1e904bc1d2, 0x3ffbf2c25bd71e09 /*2^( 102 /128),2^( 103 /128)*/ + .quad 0x3ffc199bdd85529c, 0x3ffc40ab5fffd07a /*2^( 104 /128),2^( 105 /128)*/ + .quad 0x3ffc67f12e57d14b, 0x3ffc8f6d9406e7b5 /*2^( 106 /128),2^( 107 /128)*/ + .quad 0x3ffcb720dcef9069, 0x3ffcdf0b555dc3fa /*2^( 108 /128),2^( 109 /128)*/ + .quad 0x3ffd072d4a07897c, 0x3ffd2f87080d89f2 /*2^( 110 /128),2^( 111 /128)*/ + .quad 0x3ffd5818dcfba487, 0x3ffd80e316c98398 /*2^( 112 /128),2^( 113 /128)*/ + .quad 0x3ffda9e603db3285, 0x3ffdd321f301b460 /*2^( 114 /128),2^( 115 /128)*/ + .quad 0x3ffdfc97337b9b5f, 0x3ffe264614f5a129 /*2^( 116 /128),2^( 117 /128)*/ + .quad 0x3ffe502ee78b3ff6, 0x3ffe7a51fbc74c83 /*2^( 118 /128),2^( 119 /128)*/ + .quad 0x3ffea4afa2a490da, 0x3ffecf482d8e67f1 /*2^( 120 /128),2^( 121 /128)*/ + .quad 0x3ffefa1bee615a27, 0x3fff252b376bba97 /*2^( 122 /128),2^( 123 /128)*/ + .quad 0x3fff50765b6e4540, 0x3fff7bfdad9cbe14 /*2^( 124 /128),2^( 125 /128)*/ + .quad 0x3fffa7c1819e90d8, 0x3fffd3c22b8f71f1 /*2^( 126 /128),2^( 127 /128)*/ + .align 32 + .quad 0x42c8000000000000, 0x42c8000000000000, 0x42c8000000000000, 0x42c8000000000000 /* _dbShifter - 0x433-7=0x42c shifted right on K!*/ + //log2(relerr) = -53.547756365162 + .align 32 + .quad 0x3fe62e42fefa3685, 0x3fe62e42fefa3685, 0x3fe62e42fefa3685, 0x3fe62e42fefa3685 /* _dPC1 */ + .align 32 + .quad 0x3fcebfbdff82ca48, 0x3fcebfbdff82ca48, 0x3fcebfbdff82ca48, 0x3fcebfbdff82ca48 /* _dPC2 */ + .align 32 + .quad 0x3fac6b09b180f045, 0x3fac6b09b180f045, 0x3fac6b09b180f045, 0x3fac6b09b180f045 /* _dPC3 */ + .align 32 + .quad 0x3f83b2ab5bb1268f, 0x3f83b2ab5bb1268f, 0x3f83b2ab5bb1268f, 0x3f83b2ab5bb1268f /* _dPC4 */ + .align 32 + .quad 0x000000000000007f, 0x000000000000007f, 0x000000000000007f, 0x000000000000007f /* _lIndexMask =(2^K-1)*/ + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 32 + .long 0x408fefff, 0x408fefff, 0x408fefff, 0x408fefff, 0x408fefff, 0x408fefff, 0x408fefff, 0x408fefff /* _iDomainRange */ + .align 32 + .type __svml_dexp2_data_internal,@object + .size __svml_dexp2_data_internal,.-__svml_dexp2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core-avx2.S new file mode 100644 index 0000000000..c9c17f0aaa --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized exp2, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_exp2 _ZGVeN8v_exp2_avx2_wrapper +#include "../svml_d_exp28_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core.c new file mode 100644 index 0000000000..3be9e88e98 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized exp2, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_exp2 +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_exp2, __GI__ZGVeN8v_exp2, __redirect__ZGVeN8v_exp2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core_avx512.S new file mode 100644 index 0000000000..90f21695f0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp28_core_avx512.S @@ -0,0 +1,301 @@ +/* Function exp2 vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Double precision mantissa represented as: 1.b1b2b3 ... b52 + * Constant for double precision: S = 2^48 x 1.5 + * + * 2^X = 2^Xo x 2^{X-Xo} + * 2^X = 2^K x 2^fo x 2^{X-Xo} + * 2^X = 2^K x 2^fo x 2^r + * + * 2^K --> Manual scaling + * 2^fo --> Table lookup + * r --> 1 + poly (r = X - Xo) + * + * Xo = K + fo + * Xo = K + 0.x1x2x3x4 + * + * r = X - Xo + * = Vreduce(X, imm) + * = X - VRndScale(X, imm), where Xo = VRndScale(X, imm) + * + * Rnd(S + X) = S + Xo, where S is selected as S = 2^19 x 1.5 + * S + X = S + floor(X) + 0.x1x2x3x4 + * Rnd(S + X) = Rnd(2^48 x 1.5 + X) + * (Note: 2^exp x 1.b1b2b3 ... b52, 2^{exp-52} = 2^-4 for exp=48) + * + * exp2(x) = 2^K x 2^fo x (1 + poly(r)), where 2^r = 1 + poly(r) + * + * Scale back: + * dest = src1 x 2^floor(src2) + * + * + */ + +/* Offsets for data table __svml_dexp2_data_internal_avx512 + */ +#define Frac_PowerD0 0 +#define poly_coeff1 128 +#define poly_coeff2 192 +#define poly_coeff3 256 +#define poly_coeff4 320 +#define poly_coeff5 384 +#define poly_coeff6 448 +#define add_const 512 +#define AbsMask 576 +#define Threshold 640 +#define _lIndexMask 704 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_exp2_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups poly_coeff5+__svml_dexp2_data_internal_avx512(%rip), %zmm14 + vmovups poly_coeff6+__svml_dexp2_data_internal_avx512(%rip), %zmm6 + +/* + * Reduced argument + * where VREDUCE is available + */ + vreducepd $65, {sae}, %zmm0, %zmm10 + vmovups poly_coeff4+__svml_dexp2_data_internal_avx512(%rip), %zmm7 + vmovups add_const+__svml_dexp2_data_internal_avx512(%rip), %zmm3 + vmovups poly_coeff3+__svml_dexp2_data_internal_avx512(%rip), %zmm8 + vmovups __svml_dexp2_data_internal_avx512(%rip), %zmm13 + +/* c6*r + c5 */ + vfmadd231pd {rn-sae}, %zmm10, %zmm6, %zmm14 + vmovups poly_coeff2+__svml_dexp2_data_internal_avx512(%rip), %zmm9 + vmovups Threshold+__svml_dexp2_data_internal_avx512(%rip), %zmm2 + +/* + * + * HA + * Variables and constants + * Load constants and vector(s) + */ + vmovups poly_coeff1+__svml_dexp2_data_internal_avx512(%rip), %zmm11 + +/* c6*r^2 + c5*r + c4 */ + vfmadd213pd {rn-sae}, %zmm7, %zmm10, %zmm14 + +/* + * Integer form of K+0.b1b2b3b4 in lower bits - call K_plus_f0 + * Mantisssa of normalized double precision FP: 1.b1b2...b52 + */ + vaddpd {rd-sae}, %zmm3, %zmm0, %zmm4 + vandpd AbsMask+__svml_dexp2_data_internal_avx512(%rip), %zmm0, %zmm1 + +/* c6*r^3 + c5*r^2 + c4*r + c3 */ + vfmadd213pd {rn-sae}, %zmm8, %zmm10, %zmm14 + vcmppd $29, {sae}, %zmm2, %zmm1, %k0 + +/* c6*r^4 + c5*r^3 + c4*r^2 + c3*r + c2 */ + vfmadd213pd {rn-sae}, %zmm9, %zmm10, %zmm14 + kmovw %k0, %edx + +/* c6*r^5 + c5*r^4 + c4*r^3 + c3*r^2 + c2*r + c1 */ + vfmadd213pd {rn-sae}, %zmm11, %zmm10, %zmm14 + +/* Table value: 2^(0.b1b2b3b4) */ + vpandq _lIndexMask+__svml_dexp2_data_internal_avx512(%rip), %zmm4, %zmm5 + vpermt2pd Frac_PowerD0+64+__svml_dexp2_data_internal_avx512(%rip), %zmm5, %zmm13 + +/* T*r */ + vmulpd {rn-sae}, %zmm10, %zmm13, %zmm12 + +/* T + (T*r*(c6*r^5 + c5*r^4 + c4*r^3 + c3*r^2 + c2*r + c1)) */ + vfmadd213pd {rn-sae}, %zmm13, %zmm12, %zmm14 + +/* Scaling placed at the end to avoid accuracy loss when T*r*scale underflows */ + vscalefpd {rn-sae}, %zmm0, %zmm14, %zmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm1, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm1, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call exp2@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_exp2_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dexp2_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Frac_PowerD0[16][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 add_const[8][2]; + __declspec(align(64)) VUINT32 AbsMask[8][2]; + __declspec(align(64)) VUINT32 Threshold[8][2]; + __declspec(align(64)) VUINT32 _lIndexMask[8][2]; +} __svml_dexp2_data_internal_avx512; +#endif +__svml_dexp2_data_internal_avx512: + /*== Frac_PowerD0 ==*/ + .quad 0x3FF0000000000000 + .quad 0x3FF0B5586CF9890F + .quad 0x3FF172B83C7D517B + .quad 0x3FF2387A6E756238 + .quad 0x3FF306FE0A31B715 + .quad 0x3FF3DEA64C123422 + .quad 0x3FF4BFDAD5362A27 + .quad 0x3FF5AB07DD485429 + .quad 0x3FF6A09E667F3BCD + .quad 0x3FF7A11473EB0187 + .quad 0x3FF8ACE5422AA0DB + .quad 0x3FF9C49182A3F090 + .quad 0x3FFAE89F995AD3AD + .quad 0x3FFC199BDD85529C + .quad 0x3FFD5818DCFBA487 + .quad 0x3FFEA4AFA2A490DA + .align 64 + .quad 0x3FE62E42FEFA398B, 0x3FE62E42FEFA398B, 0x3FE62E42FEFA398B, 0x3FE62E42FEFA398B, 0x3FE62E42FEFA398B, 0x3FE62E42FEFA398B, 0x3FE62E42FEFA398B, 0x3FE62E42FEFA398B /*== poly_coeff1 ==*/ + .align 64 + .quad 0x3FCEBFBDFF84555A, 0x3FCEBFBDFF84555A, 0x3FCEBFBDFF84555A, 0x3FCEBFBDFF84555A, 0x3FCEBFBDFF84555A, 0x3FCEBFBDFF84555A, 0x3FCEBFBDFF84555A, 0x3FCEBFBDFF84555A /*== poly_coeff2 ==*/ + .align 64 + .quad 0x3FAC6B08D4AD86B9, 0x3FAC6B08D4AD86B9, 0x3FAC6B08D4AD86B9, 0x3FAC6B08D4AD86B9, 0x3FAC6B08D4AD86B9, 0x3FAC6B08D4AD86B9, 0x3FAC6B08D4AD86B9, 0x3FAC6B08D4AD86B9 /*== poly_coeff3 ==*/ + .align 64 + .quad 0x3F83B2AD1B172252, 0x3F83B2AD1B172252, 0x3F83B2AD1B172252, 0x3F83B2AD1B172252, 0x3F83B2AD1B172252, 0x3F83B2AD1B172252, 0x3F83B2AD1B172252, 0x3F83B2AD1B172252 /*== poly_coeff4 ==*/ + .align 64 + .quad 0x3F55D7472713CD19, 0x3F55D7472713CD19, 0x3F55D7472713CD19, 0x3F55D7472713CD19, 0x3F55D7472713CD19, 0x3F55D7472713CD19, 0x3F55D7472713CD19, 0x3F55D7472713CD19 /*== poly_coeff5 ==*/ + .align 64 + .quad 0x3F24A1D7F526371B, 0x3F24A1D7F526371B, 0x3F24A1D7F526371B, 0x3F24A1D7F526371B, 0x3F24A1D7F526371B, 0x3F24A1D7F526371B, 0x3F24A1D7F526371B, 0x3F24A1D7F526371B /*== poly_coeff6 ==*/ + .align 64 + .quad 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000 /* add_const */ + .align 64 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff /* AbsMask */ + .align 64 + .quad 0x408fefff00000000, 0x408fefff00000000, 0x408fefff00000000, 0x408fefff00000000, 0x408fefff00000000, 0x408fefff00000000, 0x408fefff00000000, 0x408fefff00000000 /* Threshold */ + .align 64 + .quad 0x000000000000000F, 0x000000000000000F, 0x000000000000000F, 0x000000000000000F, 0x000000000000000F, 0x000000000000000F, 0x000000000000000F, 0x000000000000000F /* _lIndexMask */ + .align 64 + .type __svml_dexp2_data_internal_avx512,@object + .size __svml_dexp2_data_internal_avx512,.-__svml_dexp2_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core-avx2.S new file mode 100644 index 0000000000..4daa687852 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized exp2f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_exp2f _ZGVeN16v_exp2f_avx2_wrapper +#include "../svml_s_exp2f16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core.c new file mode 100644 index 0000000000..e90d9d8684 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized exp2f, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_exp2f +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_exp2f, __GI__ZGVeN16v_exp2f, + __redirect__ZGVeN16v_exp2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core_avx512.S new file mode 100644 index 0000000000..6b512159bc --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f16_core_avx512.S @@ -0,0 +1,271 @@ +/* Function exp2f vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Single precision mantissa represented as: 1.b1b2b3 ... b23 + * Constant for single precision: S = 2^19 x 1.5 + * + * 2^X = 2^Xo x 2^{X-Xo} + * 2^X = 2^K x 2^fo x 2^{X-Xo} + * 2^X = 2^K x 2^fo x 2^r + * + * 2^K --> Manual scaling + * 2^fo --> Table lookup + * r --> 1 + poly (r = X - Xo) + * + * Xo = K + fo + * Xo = K + 0.x1x2x3x4 + * + * r = X - Xo + * = Vreduce(X, imm) + * = X - VRndScale(X, imm), where Xo = VRndScale(X, imm) + * + * Rnd(S + X) = S + Xo, where S is selected as S = 2^19 x 1.5 + * S + X = S + floor(X) + 0.x1x2x3x4 + * Rnd(S + X) = Rnd(2^19 x 1.5 + X) + * (Note: 2^exp x 1.b1b2b3 ... b23, 2^{exp-23} = 2^-4 for exp=19) + * + * exp2(x) = 2^K x 2^fo x (1 + poly(r)), where 2^r = 1 + poly(r) + * + * Scale back: + * dest = src1 x 2^floor(src2) + * + * + */ + +/* Offsets for data table __svml_sexp2_data_internal_avx512 + */ +#define Frac_PowerS0 0 +#define poly_coeff1 64 +#define poly_coeff2 128 +#define poly_coeff3 192 +#define add_const 256 +#define AbsMask 320 +#define Threshold 384 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_exp2f_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups add_const+__svml_sexp2_data_internal_avx512(%rip), %zmm3 + +/* + * Reduced argument + * where VREDUCE is available + */ + vreduceps $65, {sae}, %zmm0, %zmm6 + vmovups poly_coeff3+__svml_sexp2_data_internal_avx512(%rip), %zmm5 + vmovups poly_coeff2+__svml_sexp2_data_internal_avx512(%rip), %zmm10 + vmovups Threshold+__svml_sexp2_data_internal_avx512(%rip), %zmm2 + +/* + * + * HA + * Variables and constants + * Load constants and vector(s) + */ + vmovups poly_coeff1+__svml_sexp2_data_internal_avx512(%rip), %zmm7 + +/* + * Integer form of K+0.b1b2b3b4 in lower bits - call K_plus_f0 + * Mantisssa of normalized single precision FP: 1.b1b2...b23 + */ + vaddps {rd-sae}, %zmm3, %zmm0, %zmm4 + vandps AbsMask+__svml_sexp2_data_internal_avx512(%rip), %zmm0, %zmm1 + +/* c3*r + c2 */ + vfmadd231ps {rn-sae}, %zmm6, %zmm5, %zmm10 + vcmpps $30, {sae}, %zmm2, %zmm1, %k0 + +/* c3*r^2 + c2*r + c1 */ + vfmadd213ps {rn-sae}, %zmm7, %zmm6, %zmm10 + +/* Table value: 2^(0.b1b2b3b4) */ + vpermps __svml_sexp2_data_internal_avx512(%rip), %zmm4, %zmm9 + kmovw %k0, %edx + +/* T*r */ + vmulps {rn-sae}, %zmm6, %zmm9, %zmm8 + +/* T + (T*r*(c3*r^2 + c2*r + c1) */ + vfmadd213ps {rn-sae}, %zmm9, %zmm8, %zmm10 + +/* Scaling placed at the end to avoid accuracy loss when T*r*scale underflows */ + vscalefps {rn-sae}, %zmm0, %zmm10, %zmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm1, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm1, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call exp2f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_exp2f_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_sexp2_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Frac_PowerS0[16][1]; + __declspec(align(64)) VUINT32 poly_coeff1[16][1]; + __declspec(align(64)) VUINT32 poly_coeff2[16][1]; + __declspec(align(64)) VUINT32 poly_coeff3[16][1]; + __declspec(align(64)) VUINT32 add_const[16][1]; + __declspec(align(64)) VUINT32 AbsMask[16][1]; + __declspec(align(64)) VUINT32 Threshold[16][1]; +} __svml_sexp2_data_internal_avx512; +#endif +__svml_sexp2_data_internal_avx512: + /*== Frac_PowerS0 ==*/ + .long 0x3F800000 + .long 0x3F85AAC3 + .long 0x3F8B95C2 + .long 0x3F91C3D3 + .long 0x3F9837F0 + .long 0x3F9EF532 + .long 0x3FA5FED7 + .long 0x3FAD583F + .long 0x3FB504F3 + .long 0x3FBD08A4 + .long 0x3FC5672A + .long 0x3FCE248C + .long 0x3FD744FD + .long 0x3FE0CCDF + .long 0x3FEAC0C7 + .long 0x3FF5257D + .align 64 + .long 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222, 0x3F317222 /*== poly_coeff1 ==*/ + .align 64 + .long 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B, 0x3E75F16B /*== poly_coeff2 ==*/ + .align 64 + .long 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA, 0x3D6854CA /*== poly_coeff3 ==*/ + .align 64 + .long 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000, 0x49400000 /* add_const */ + .align 64 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* AbsMask */ + .align 64 + .long 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000 /* Threshold=126.0 */ + .align 64 + .type __svml_sexp2_data_internal_avx512,@object + .size __svml_sexp2_data_internal_avx512,.-__svml_sexp2_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core-sse2.S new file mode 100644 index 0000000000..0b3fec834c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized exp2f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_exp2f _ZGVbN4v_exp2f_sse2 +#include "../svml_s_exp2f4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core.c new file mode 100644 index 0000000000..db47118d97 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized exp2f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_exp2f +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_exp2f, __GI__ZGVbN4v_exp2f, + __redirect__ZGVbN4v_exp2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core_sse4.S new file mode 100644 index 0000000000..0d9f45d5c3 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f4_core_sse4.S @@ -0,0 +1,238 @@ +/* Function exp2f vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp2(x) = 2^n * T[j] * (1 + P(y)) + * where + * x = m*(1/K) + y, y in [-1/K..1/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp2(x)-1 + * on small interval [-1/K..1/K] + * + * Special cases: + * + * exp2(NaN) = NaN + * exp2(+INF) = +INF + * exp2(-INF) = 0 + * exp2(x) = 1 for subnormals + * For IEEE float + * if x >= 128.0 then exp2f(x) overflow + * if x < -151.0 then exp2f(x) underflow + * + */ + +/* Offsets for data table __svml_sexp2_data_internal + */ +#define _sShifter 0 +#define _sPC0 16 +#define _sPC1 32 +#define _sPC2 48 +#define _sPC3 64 +#define _sPC4 80 +#define _sPC5 96 +#define _sPC6 112 +#define _iAbsMask 128 +#define _iDomainRange 144 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_exp2f_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* Check for overflow\underflow */ + movups __svml_sexp2_data_internal(%rip), %xmm1 + +/* Implementation */ + movaps %xmm1, %xmm5 + +/* Polynomial */ + movups _sPC6+__svml_sexp2_data_internal(%rip), %xmm4 + addps %xmm0, %xmm5 + movaps %xmm5, %xmm3 + +/* 2^N */ + pslld $23, %xmm5 + +/* Check for overflow\underflow */ + movdqu _iAbsMask+__svml_sexp2_data_internal(%rip), %xmm2 + subps %xmm1, %xmm3 + +/* R */ + movaps %xmm0, %xmm1 + pand %xmm0, %xmm2 + pcmpgtd _iDomainRange+__svml_sexp2_data_internal(%rip), %xmm2 + subps %xmm3, %xmm1 + movmskps %xmm2, %edx + mulps %xmm1, %xmm4 + addps _sPC5+__svml_sexp2_data_internal(%rip), %xmm4 + mulps %xmm1, %xmm4 + addps _sPC4+__svml_sexp2_data_internal(%rip), %xmm4 + mulps %xmm1, %xmm4 + addps _sPC3+__svml_sexp2_data_internal(%rip), %xmm4 + mulps %xmm1, %xmm4 + addps _sPC2+__svml_sexp2_data_internal(%rip), %xmm4 + mulps %xmm1, %xmm4 + addps _sPC1+__svml_sexp2_data_internal(%rip), %xmm4 + mulps %xmm4, %xmm1 + addps _sPC0+__svml_sexp2_data_internal(%rip), %xmm1 + +/* Reconstruction */ + paddd %xmm5, %xmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm1, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm1 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call exp2f@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_exp2f_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_sexp2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _sShifter[4][1]; + __declspec(align(16)) VUINT32 _sPC0[4][1]; + __declspec(align(16)) VUINT32 _sPC1[4][1]; + __declspec(align(16)) VUINT32 _sPC2[4][1]; + __declspec(align(16)) VUINT32 _sPC3[4][1]; + __declspec(align(16)) VUINT32 _sPC4[4][1]; + __declspec(align(16)) VUINT32 _sPC5[4][1]; + __declspec(align(16)) VUINT32 _sPC6[4][1]; + __declspec(align(16)) VUINT32 _iAbsMask[4][1]; + __declspec(align(16)) VUINT32 _iDomainRange[4][1]; +} __svml_sexp2_data_internal; +#endif +__svml_sexp2_data_internal: + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 16 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC0 */ + .align 16 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 /* _sPC1 */ + .align 16 + .long 0x3e75fdef, 0x3e75fdef, 0x3e75fdef, 0x3e75fdef /* _sPC2 */ + .align 16 + .long 0x3d6357cf, 0x3d6357cf, 0x3d6357cf, 0x3d6357cf /* _sPC3 */ + .align 16 + .long 0x3c1d962c, 0x3c1d962c, 0x3c1d962c, 0x3c1d962c /* _sPC4 */ + .align 16 + .long 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51 /* _sPC5 */ + .align 16 + .long 0x39213c8c, 0x39213c8c, 0x39213c8c, 0x39213c8c /* _sPC6 */ + //common + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 16 + .long 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000 /* _iDomainRange=126.0 */ + .align 16 + .type __svml_sexp2_data_internal,@object + .size __svml_sexp2_data_internal,.-__svml_sexp2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core-sse.S new file mode 100644 index 0000000000..4da2278ed8 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized exp2f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_exp2f _ZGVdN8v_exp2f_sse_wrapper +#include "../svml_s_exp2f8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core.c new file mode 100644 index 0000000000..dc34671263 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized exp2f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_exp2f +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_exp2f, __GI__ZGVdN8v_exp2f, + __redirect__ZGVdN8v_exp2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core_avx2.S new file mode 100644 index 0000000000..aa7af4be79 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp2f8_core_avx2.S @@ -0,0 +1,245 @@ +/* Function exp2f vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp2(x) = 2^n * T[j] * (1 + P(y)) + * where + * x = m*(1/K) + y, y in [-1/K..1/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp2(x)-1 + * on small interval [-1/K..1/K] + * + * Special cases: + * + * exp2(NaN) = NaN + * exp2(+INF) = +INF + * exp2(-INF) = 0 + * exp2(x) = 1 for subnormals + * For IEEE float + * if x >= 128.0 then exp2f(x) overflow + * if x < -151.0 then exp2f(x) underflow + * + */ + +/* Offsets for data table __svml_sexp2_data_internal + */ +#define _sShifter 0 +#define _sPC0 32 +#define _sPC1 64 +#define _sPC2 96 +#define _sPC3 128 +#define _sPC4 160 +#define _sPC5 192 +#define _sPC6 224 +#define _iAbsMask 256 +#define _iDomainRange 288 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_exp2f_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + vmovups __svml_sexp2_data_internal(%rip), %ymm1 + +/* Check for overflow\underflow */ + vmovups _sPC6+__svml_sexp2_data_internal(%rip), %ymm7 + +/* Implementation */ + vaddps %ymm1, %ymm0, %ymm6 + vsubps %ymm1, %ymm6, %ymm4 + +/* 2^N */ + vpslld $23, %ymm6, %ymm8 + +/* R */ + vsubps %ymm4, %ymm0, %ymm5 + +/* Polynomial */ + vfmadd213ps _sPC5+__svml_sexp2_data_internal(%rip), %ymm5, %ymm7 + vfmadd213ps _sPC4+__svml_sexp2_data_internal(%rip), %ymm5, %ymm7 + vfmadd213ps _sPC3+__svml_sexp2_data_internal(%rip), %ymm5, %ymm7 + vfmadd213ps _sPC2+__svml_sexp2_data_internal(%rip), %ymm5, %ymm7 + vfmadd213ps _sPC1+__svml_sexp2_data_internal(%rip), %ymm5, %ymm7 + vfmadd213ps _sPC0+__svml_sexp2_data_internal(%rip), %ymm5, %ymm7 + +/* Check for overflow\underflow */ + vandps _iAbsMask+__svml_sexp2_data_internal(%rip), %ymm0, %ymm2 + vpcmpgtd _iDomainRange+__svml_sexp2_data_internal(%rip), %ymm2, %ymm3 + vmovmskps %ymm3, %edx + +/* Reconstruction */ + vpaddd %ymm8, %ymm7, %ymm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %ymm1, %ymm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm0, 32(%rsp) + vmovups %ymm1, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call exp2f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_exp2f_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_sexp2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _sShifter[8][1]; + __declspec(align(32)) VUINT32 _sPC0[8][1]; + __declspec(align(32)) VUINT32 _sPC1[8][1]; + __declspec(align(32)) VUINT32 _sPC2[8][1]; + __declspec(align(32)) VUINT32 _sPC3[8][1]; + __declspec(align(32)) VUINT32 _sPC4[8][1]; + __declspec(align(32)) VUINT32 _sPC5[8][1]; + __declspec(align(32)) VUINT32 _sPC6[8][1]; + __declspec(align(32)) VUINT32 _iAbsMask[8][1]; + __declspec(align(32)) VUINT32 _iDomainRange[8][1]; +} __svml_sexp2_data_internal; +#endif +__svml_sexp2_data_internal: + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 32 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC0 */ + .align 32 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 /* _sPC1 */ + .align 32 + .long 0x3e75fdef, 0x3e75fdef, 0x3e75fdef, 0x3e75fdef, 0x3e75fdef, 0x3e75fdef, 0x3e75fdef, 0x3e75fdef /* _sPC2 */ + .align 32 + .long 0x3d6357cf, 0x3d6357cf, 0x3d6357cf, 0x3d6357cf, 0x3d6357cf, 0x3d6357cf, 0x3d6357cf, 0x3d6357cf /* _sPC3 */ + .align 32 + .long 0x3c1d962c, 0x3c1d962c, 0x3c1d962c, 0x3c1d962c, 0x3c1d962c, 0x3c1d962c, 0x3c1d962c, 0x3c1d962c /* _sPC4 */ + .align 32 + .long 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51, 0x3aaf7a51 /* _sPC5 */ + .align 32 + .long 0x39213c8c, 0x39213c8c, 0x39213c8c, 0x39213c8c, 0x39213c8c, 0x39213c8c, 0x39213c8c, 0x39213c8c /* _sPC6 */ + //common + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 32 + .long 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000, 0x42fc0000 /* _iDomainRange=126.0 */ + .align 32 + .type __svml_sexp2_data_internal,@object + .size __svml_sexp2_data_internal,.-__svml_sexp2_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_exp22_core.S b/sysdeps/x86_64/fpu/svml_d_exp22_core.S new file mode 100644 index 0000000000..f03080a977 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp22_core.S @@ -0,0 +1,29 @@ +/* Function exp2 vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_exp2) +WRAPPER_IMPL_SSE2 exp2 +END (_ZGVbN2v_exp2) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_exp2) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_exp24_core.S b/sysdeps/x86_64/fpu/svml_d_exp24_core.S new file mode 100644 index 0000000000..40475c7a94 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp24_core.S @@ -0,0 +1,29 @@ +/* Function exp2 vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_exp2) +WRAPPER_IMPL_AVX _ZGVbN2v_exp2 +END (_ZGVdN4v_exp2) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_exp2) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_exp24_core_avx.S b/sysdeps/x86_64/fpu/svml_d_exp24_core_avx.S new file mode 100644 index 0000000000..a7d22409df --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp24_core_avx.S @@ -0,0 +1,25 @@ +/* Function exp2 vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_exp2) +WRAPPER_IMPL_AVX _ZGVbN2v_exp2 +END (_ZGVcN4v_exp2) diff --git a/sysdeps/x86_64/fpu/svml_d_exp28_core.S b/sysdeps/x86_64/fpu/svml_d_exp28_core.S new file mode 100644 index 0000000000..f68aaed427 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp28_core.S @@ -0,0 +1,25 @@ +/* Function exp2 vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_exp2) +WRAPPER_IMPL_AVX512 _ZGVdN4v_exp2 +END (_ZGVeN8v_exp2) diff --git a/sysdeps/x86_64/fpu/svml_s_exp2f16_core.S b/sysdeps/x86_64/fpu/svml_s_exp2f16_core.S new file mode 100644 index 0000000000..8ba4e82272 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp2f16_core.S @@ -0,0 +1,25 @@ +/* Function exp2f vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_exp2f) +WRAPPER_IMPL_AVX512 _ZGVdN8v_exp2f +END (_ZGVeN16v_exp2f) diff --git a/sysdeps/x86_64/fpu/svml_s_exp2f4_core.S b/sysdeps/x86_64/fpu/svml_s_exp2f4_core.S new file mode 100644 index 0000000000..916f176dca --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp2f4_core.S @@ -0,0 +1,29 @@ +/* Function exp2f vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_exp2f) +WRAPPER_IMPL_SSE2 exp2f +END (_ZGVbN4v_exp2f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_exp2f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_exp2f8_core.S b/sysdeps/x86_64/fpu/svml_s_exp2f8_core.S new file mode 100644 index 0000000000..b8821b952b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp2f8_core.S @@ -0,0 +1,29 @@ +/* Function exp2f vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_exp2f) +WRAPPER_IMPL_AVX _ZGVbN4v_exp2f +END (_ZGVdN8v_exp2f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_exp2f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_exp2f8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_exp2f8_core_avx.S new file mode 100644 index 0000000000..ddaaf3b59a --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp2f8_core_avx.S @@ -0,0 +1,25 @@ +/* Function exp2f vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_exp2f) +WRAPPER_IMPL_AVX _ZGVbN4v_exp2f +END (_ZGVcN8v_exp2f) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx.c new file mode 100644 index 0000000000..341ec99724 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-exp2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx2.c new file mode 100644 index 0000000000..341ec99724 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-exp2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx512f.c new file mode 100644 index 0000000000..341ec99724 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp2-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-exp2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp2.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp2.c new file mode 100644 index 0000000000..b3b04f63e4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp2.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC exp2 +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 9bc9d1dafa..2f7172bd7b 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVbN2v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVbN2v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVbN2v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVbN2vv_hypot) +VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVbN2v_exp2) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index c41994d90a..e2d519faac 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVdN4v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVdN4v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVdN4v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVdN4vv_hypot) +VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVdN4v_exp2) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 881f6c801a..1ce4d8b413 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVcN4v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVcN4v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVcN4v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVcN4vv_hypot) +VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVcN4v_exp2) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 6fd106fe68..6c87cec648 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acos), _ZGVeN8v_acos) VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVeN8v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVeN8v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVeN8vv_hypot) +VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVeN8v_exp2) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx.c new file mode 100644 index 0000000000..0281d386fb --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-exp2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx2.c new file mode 100644 index 0000000000..0281d386fb --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-exp2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx512f.c new file mode 100644 index 0000000000..0281d386fb --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-exp2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp2f.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f.c new file mode 100644 index 0000000000..bf57661bee --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp2f.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC exp2f +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 4c2ea6ddfe..597d7d7598 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVeN16v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVeN16v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVeN16v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVeN16vv_hypotf) +VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVeN16v_exp2f) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 1d5d952d07..3500eec810 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVbN4v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVbN4v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVbN4v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVbN4vv_hypotf) +VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVbN4v_exp2f) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 7a750f3781..921b9c65d6 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVdN8v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVdN8v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVdN8v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVdN8vv_hypotf) +VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVdN8v_exp2f) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index af816a7789..6cbcb57521 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -31,6 +31,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (acosf), _ZGVcN8v_acosf) VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVcN8v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVcN8v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVcN8vv_hypotf) +VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVcN8v_exp2f) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573815 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=BY6/ev5b; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm4B1Mywz9sXS for ; Wed, 29 Dec 2021 07:15:22 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id AB517385843E for ; Tue, 28 Dec 2021 20:15:19 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AB517385843E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722519; bh=qhHT8aK9DHg/QN4Fr66B0JaDHKJbuYwiAI/e+LCDlMA=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=BY6/ev5bpuF5NegmHLi0yG6/1sbDhK/QmLWxW6MsNx2IybljBnOXhOY8jJ2r29Ma9 moqczc5TWh7c3fyZIK09Ye28ibjHRQtAlOvfR6sPlnfxUc8Nrhz2/8oluYD2zNnEA/ pcwmsoEEbhOG9iVoPqjyGuVjAVLZ7KefDsH9LiGo= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by sourceware.org (Postfix) with ESMTPS id DD50D3858406 for ; Tue, 28 Dec 2021 20:11:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DD50D3858406 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="302169135" X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="302169135" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="609402446" Received: from scymds01.sc.intel.com ([10.148.94.138]) by FMSMGA003.fm.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsZ016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 05/18] x86-64: Add vector exp10/exp10f implementation to libmvec Date: Tue, 28 Dec 2021 12:11:17 -0800 Message-Id: <20211228201130.737370-6-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized exp10/exp10f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector exp10/exp10f with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_exp102_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_exp102_core.c | 27 ++ .../fpu/multiarch/svml_d_exp102_core_sse4.S | 418 +++++++++++++++++ .../fpu/multiarch/svml_d_exp104_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_exp104_core.c | 27 ++ .../fpu/multiarch/svml_d_exp104_core_avx2.S | 429 ++++++++++++++++++ .../fpu/multiarch/svml_d_exp108_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_exp108_core.c | 27 ++ .../fpu/multiarch/svml_d_exp108_core_avx512.S | 287 ++++++++++++ .../fpu/multiarch/svml_s_exp10f16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_exp10f16_core.c | 28 ++ .../multiarch/svml_s_exp10f16_core_avx512.S | 269 +++++++++++ .../fpu/multiarch/svml_s_exp10f4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_exp10f4_core.c | 28 ++ .../fpu/multiarch/svml_s_exp10f4_core_sse4.S | 311 +++++++++++++ .../fpu/multiarch/svml_s_exp10f8_core-sse.S | 20 + .../fpu/multiarch/svml_s_exp10f8_core.c | 28 ++ .../fpu/multiarch/svml_s_exp10f8_core_avx2.S | 331 ++++++++++++++ sysdeps/x86_64/fpu/svml_d_exp102_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_exp104_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_exp104_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_exp108_core.S | 25 + sysdeps/x86_64/fpu/svml_s_exp10f16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_exp10f4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_exp10f8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_exp10f8_core_avx.S | 25 + .../fpu/test-double-libmvec-exp10-avx.c | 1 + .../fpu/test-double-libmvec-exp10-avx2.c | 1 + .../fpu/test-double-libmvec-exp10-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-exp10.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-exp10f-avx.c | 1 + .../fpu/test-float-libmvec-exp10f-avx2.c | 1 + .../fpu/test-float-libmvec-exp10f-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-exp10f.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 2617 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp102_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp104_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp104_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_exp108_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp10f16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp10f4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp10f8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_exp10f8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-exp10.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-exp10f.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 36d6643eb9..bc18621f17 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -153,4 +153,15 @@ #define __DECL_SIMD_exp2f32x #define __DECL_SIMD_exp2f64x #define __DECL_SIMD_exp2f128x + +#define __DECL_SIMD_exp10 +#define __DECL_SIMD_exp10f +#define __DECL_SIMD_exp10l +#define __DECL_SIMD_exp10f16 +#define __DECL_SIMD_exp10f32 +#define __DECL_SIMD_exp10f64 +#define __DECL_SIMD_exp10f128 +#define __DECL_SIMD_exp10f32x +#define __DECL_SIMD_exp10f64x +#define __DECL_SIMD_exp10f128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 645088cbf3..870778457f 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -111,7 +111,7 @@ __MATHCALL (modf,, (_Mdouble_ __x, _Mdouble_ *__iptr)) __nonnull ((2)); #if __GLIBC_USE (IEC_60559_FUNCS_EXT_C2X) /* Compute exponent to base ten. */ -__MATHCALL (exp10,, (_Mdouble_ __x)); +__MATHCALL_VEC (exp10,, (_Mdouble_ __x)); #endif #if defined __USE_XOPEN_EXTENDED || defined __USE_ISOC99 diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 1717f2dee9..b3c1f59593 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -49,40 +49,48 @@ GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F +GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F +GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F +GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F +GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F +GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F +GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F +GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F +GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index c7a972521b..f3f9c2e092 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -78,6 +78,10 @@ # define __DECL_SIMD_exp2 __DECL_SIMD_x86_64 # undef __DECL_SIMD_exp2f # define __DECL_SIMD_exp2f __DECL_SIMD_x86_64 +# undef __DECL_SIMD_exp10 +# define __DECL_SIMD_exp10 __DECL_SIMD_x86_64 +# undef __DECL_SIMD_exp10f +# define __DECL_SIMD_exp10f __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 0994e6dfac..c033abbedc 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -38,6 +38,8 @@ !GCC$ builtin (hypotf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (exp2) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (exp2f) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (exp10) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (exp10f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -61,3 +63,5 @@ !GCC$ builtin (hypotf) attributes simd (notinbranch) if('x32') !GCC$ builtin (exp2) attributes simd (notinbranch) if('x32') !GCC$ builtin (exp2f) attributes simd (notinbranch) if('x32') +!GCC$ builtin (exp10) attributes simd (notinbranch) if('x32') +!GCC$ builtin (exp10f) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 03b2364417..fd0a9da439 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -27,6 +27,7 @@ libmvec-funcs = \ atan \ cos \ exp \ + exp10 \ exp2 \ hypot \ log \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 12b7ad1830..f29cfa4cbf 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -17,11 +17,13 @@ libmvec { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; + _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; + _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index bc4479ad39..45f2e4bb53 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1252,6 +1252,26 @@ float: 1 float128: 3 ldouble: 2 +Function: "exp10_vlen16": +float: 3 + +Function: "exp10_vlen2": +double: 1 + +Function: "exp10_vlen4": +double: 1 +float: 1 + +Function: "exp10_vlen4_avx2": +double: 1 + +Function: "exp10_vlen8": +double: 1 +float: 1 + +Function: "exp10_vlen8_avx2": +float: 1 + Function: "exp2": double: 1 float: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core-sse2.S new file mode 100644 index 0000000000..ab615c0323 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized exp10, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_exp10 _ZGVbN2v_exp10_sse2 +#include "../svml_d_exp102_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core.c new file mode 100644 index 0000000000..5c5625b278 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized exp10, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_exp10 +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_exp10, __GI__ZGVbN2v_exp10, __redirect__ZGVbN2v_exp10) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core_sse4.S new file mode 100644 index 0000000000..7c6e5de3e0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp102_core_sse4.S @@ -0,0 +1,418 @@ +/* Function exp10 vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp10(x) = 2^x/log10(2) = 2^n * (1 + T[j]) * (1 + P(y)) + * where + * x = m*log10(2)/K + y, y in [-log10(2)/K..log10(2)/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp10(x)-1 + * on small interval [-log10(2)/K..log10(2)/K] + * + * Special cases: + * + * exp10(NaN) = NaN + * exp10(+INF) = +INF + * exp10(-INF) = 0 + * exp10(x) = 1 for subnormals + * For IEEE double + * if x > 3.39782712893383973096e+02 then exp10(x) overflow + * if x < -3.45133219101941108420e+02 then exp10(x) underflow + * + */ + +/* Offsets for data table __svml_dexp10_data_internal + */ +#define _dbT 0 +#define _dbLg2_10 1024 +#define _dbShifter 1040 +#define _dbInvLg2_10hi 1056 +#define _dbInvLg2_10lo 1072 +#define _dPC1 1088 +#define _dPC2 1104 +#define _dPC3 1120 +#define _dPC4 1136 +#define _dPC5 1152 +#define _lExpMask 1168 +#define _iIndexMask 1184 +#define _iAbsMask 1200 +#define _iDomainRange 1216 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_exp10_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* R */ + movaps %xmm0, %xmm12 + +/* Load arument */ + movups _dbLg2_10+__svml_dexp10_data_internal(%rip), %xmm13 + lea __svml_dexp10_data_internal(%rip), %rsi + mulpd %xmm0, %xmm13 + movups _dbShifter+__svml_dexp10_data_internal(%rip), %xmm1 + addpd %xmm1, %xmm13 + movaps %xmm13, %xmm9 + subpd %xmm1, %xmm9 + movups _dbInvLg2_10hi+__svml_dexp10_data_internal(%rip), %xmm8 + mulpd %xmm9, %xmm8 + movups _dbInvLg2_10lo+__svml_dexp10_data_internal(%rip), %xmm10 + mulpd %xmm9, %xmm10 + subpd %xmm8, %xmm12 + subpd %xmm10, %xmm12 + +/* + * Polynomial + * poly(dN) = a1*dR+...+a5*dR^5 + */ + movups _dPC5+__svml_dexp10_data_internal(%rip), %xmm11 + mulpd %xmm12, %xmm11 + addpd _dPC4+__svml_dexp10_data_internal(%rip), %xmm11 + mulpd %xmm12, %xmm11 + addpd _dPC3+__svml_dexp10_data_internal(%rip), %xmm11 + mulpd %xmm12, %xmm11 + addpd _dPC2+__svml_dexp10_data_internal(%rip), %xmm11 + +/* a1+...+a5*dR^4 ! */ + mulpd %xmm12, %xmm11 + addpd _dPC1+__svml_dexp10_data_internal(%rip), %xmm11 + movq _iIndexMask+__svml_dexp10_data_internal(%rip), %xmm5 + +/* Index and lookup */ + pshufd $136, %xmm13, %xmm6 + +/* 2^N */ + psllq $45, %xmm13 + pand %xmm5, %xmm6 + +/* iIndex*=sizeof(D); */ + pslld $3, %xmm6 + movd %xmm6, %eax + pshufd $1, %xmm6, %xmm7 + movq _iAbsMask+__svml_dexp10_data_internal(%rip), %xmm2 + +/* a1*dR+...+a5*dR^5 */ + mulpd %xmm11, %xmm12 + movd %xmm7, %ecx + +/* Check for overflow\underflow */ + pshufd $221, %xmm0, %xmm4 + movq _iDomainRange+__svml_dexp10_data_internal(%rip), %xmm3 + pand %xmm2, %xmm4 + movslq %eax, %rax + pcmpgtd %xmm3, %xmm4 + movslq %ecx, %rcx + movmskps %xmm4, %edx + +/* lM==EXP(2^N) */ + pand _lExpMask+__svml_dexp10_data_internal(%rip), %xmm13 + movsd (%rsi,%rax), %xmm1 + movhpd (%rsi,%rcx), %xmm1 + +/* Tj*poly */ + mulpd %xmm1, %xmm12 + addpd %xmm12, %xmm1 + +/* quick 2^N */ + paddq %xmm13, %xmm1 + andl $3, %edx + +/* Finish */ + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm1, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx xmm1 + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm1 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call exp10@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2v_exp10_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dexp10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dbT[(1<<7)][2]; + __declspec(align(16)) VUINT32 _dbLg2_10[2][2]; + __declspec(align(16)) VUINT32 _dbShifter[2][2]; + __declspec(align(16)) VUINT32 _dbInvLg2_10hi[2][2]; + __declspec(align(16)) VUINT32 _dbInvLg2_10lo[2][2]; + __declspec(align(16)) VUINT32 _dPC1[2][2]; + __declspec(align(16)) VUINT32 _dPC2[2][2]; + __declspec(align(16)) VUINT32 _dPC3[2][2]; + __declspec(align(16)) VUINT32 _dPC4[2][2]; + __declspec(align(16)) VUINT32 _dPC5[2][2]; + __declspec(align(16)) VUINT32 _lExpMask[2][2]; + __declspec(align(16)) VUINT32 _iIndexMask[4][1]; + __declspec(align(16)) VUINT32 _iAbsMask[4][1]; + __declspec(align(16)) VUINT32 _iDomainRange[4][1]; +} __svml_dexp10_data_internal; +#endif +__svml_dexp10_data_internal: + /*== _dbT ==*/ + .quad 0x3ff0000000000000 /*2^( 0 /128)*/ + .quad 0x3ff0163da9fb3335 /*2^( 1 /128)*/ + .quad 0x3ff02c9a3e778061 /*2^( 2 /128)*/ + .quad 0x3ff04315e86e7f85 /*2^( 3 /128)*/ + .quad 0x3ff059b0d3158574 /*2^( 4 /128)*/ + .quad 0x3ff0706b29ddf6de /*2^( 5 /128)*/ + .quad 0x3ff0874518759bc8 /*2^( 6 /128)*/ + .quad 0x3ff09e3ecac6f383 /*2^( 7 /128)*/ + .quad 0x3ff0b5586cf9890f /*2^( 8 /128)*/ + .quad 0x3ff0cc922b7247f7 /*2^( 9 /128)*/ + .quad 0x3ff0e3ec32d3d1a2 /*2^( 10 /128)*/ + .quad 0x3ff0fb66affed31b /*2^( 11 /128)*/ + .quad 0x3ff11301d0125b51 /*2^( 12 /128)*/ + .quad 0x3ff12abdc06c31cc /*2^( 13 /128)*/ + .quad 0x3ff1429aaea92de0 /*2^( 14 /128)*/ + .quad 0x3ff15a98c8a58e51 /*2^( 15 /128)*/ + .quad 0x3ff172b83c7d517b /*2^( 16 /128)*/ + .quad 0x3ff18af9388c8dea /*2^( 17 /128)*/ + .quad 0x3ff1a35beb6fcb75 /*2^( 18 /128)*/ + .quad 0x3ff1bbe084045cd4 /*2^( 19 /128)*/ + .quad 0x3ff1d4873168b9aa /*2^( 20 /128)*/ + .quad 0x3ff1ed5022fcd91d /*2^( 21 /128)*/ + .quad 0x3ff2063b88628cd6 /*2^( 22 /128)*/ + .quad 0x3ff21f49917ddc96 /*2^( 23 /128)*/ + .quad 0x3ff2387a6e756238 /*2^( 24 /128)*/ + .quad 0x3ff251ce4fb2a63f /*2^( 25 /128)*/ + .quad 0x3ff26b4565e27cdd /*2^( 26 /128)*/ + .quad 0x3ff284dfe1f56381 /*2^( 27 /128)*/ + .quad 0x3ff29e9df51fdee1 /*2^( 28 /128)*/ + .quad 0x3ff2b87fd0dad990 /*2^( 29 /128)*/ + .quad 0x3ff2d285a6e4030b /*2^( 30 /128)*/ + .quad 0x3ff2ecafa93e2f56 /*2^( 31 /128)*/ + .quad 0x3ff306fe0a31b715 /*2^( 32 /128)*/ + .quad 0x3ff32170fc4cd831 /*2^( 33 /128)*/ + .quad 0x3ff33c08b26416ff /*2^( 34 /128)*/ + .quad 0x3ff356c55f929ff1 /*2^( 35 /128)*/ + .quad 0x3ff371a7373aa9cb /*2^( 36 /128)*/ + .quad 0x3ff38cae6d05d866 /*2^( 37 /128)*/ + .quad 0x3ff3a7db34e59ff7 /*2^( 38 /128)*/ + .quad 0x3ff3c32dc313a8e5 /*2^( 39 /128)*/ + .quad 0x3ff3dea64c123422 /*2^( 40 /128)*/ + .quad 0x3ff3fa4504ac801c /*2^( 41 /128)*/ + .quad 0x3ff4160a21f72e2a /*2^( 42 /128)*/ + .quad 0x3ff431f5d950a897 /*2^( 43 /128)*/ + .quad 0x3ff44e086061892d /*2^( 44 /128)*/ + .quad 0x3ff46a41ed1d0057 /*2^( 45 /128)*/ + .quad 0x3ff486a2b5c13cd0 /*2^( 46 /128)*/ + .quad 0x3ff4a32af0d7d3de /*2^( 47 /128)*/ + .quad 0x3ff4bfdad5362a27 /*2^( 48 /128)*/ + .quad 0x3ff4dcb299fddd0d /*2^( 49 /128)*/ + .quad 0x3ff4f9b2769d2ca7 /*2^( 50 /128)*/ + .quad 0x3ff516daa2cf6642 /*2^( 51 /128)*/ + .quad 0x3ff5342b569d4f82 /*2^( 52 /128)*/ + .quad 0x3ff551a4ca5d920f /*2^( 53 /128)*/ + .quad 0x3ff56f4736b527da /*2^( 54 /128)*/ + .quad 0x3ff58d12d497c7fd /*2^( 55 /128)*/ + .quad 0x3ff5ab07dd485429 /*2^( 56 /128)*/ + .quad 0x3ff5c9268a5946b7 /*2^( 57 /128)*/ + .quad 0x3ff5e76f15ad2148 /*2^( 58 /128)*/ + .quad 0x3ff605e1b976dc09 /*2^( 59 /128)*/ + .quad 0x3ff6247eb03a5585 /*2^( 60 /128)*/ + .quad 0x3ff6434634ccc320 /*2^( 61 /128)*/ + .quad 0x3ff6623882552225 /*2^( 62 /128)*/ + .quad 0x3ff68155d44ca973 /*2^( 63 /128)*/ + .quad 0x3ff6a09e667f3bcd /*2^( 64 /128)*/ + .quad 0x3ff6c012750bdabf /*2^( 65 /128)*/ + .quad 0x3ff6dfb23c651a2f /*2^( 66 /128)*/ + .quad 0x3ff6ff7df9519484 /*2^( 67 /128)*/ + .quad 0x3ff71f75e8ec5f74 /*2^( 68 /128)*/ + .quad 0x3ff73f9a48a58174 /*2^( 69 /128)*/ + .quad 0x3ff75feb564267c9 /*2^( 70 /128)*/ + .quad 0x3ff780694fde5d3f /*2^( 71 /128)*/ + .quad 0x3ff7a11473eb0187 /*2^( 72 /128)*/ + .quad 0x3ff7c1ed0130c132 /*2^( 73 /128)*/ + .quad 0x3ff7e2f336cf4e62 /*2^( 74 /128)*/ + .quad 0x3ff80427543e1a12 /*2^( 75 /128)*/ + .quad 0x3ff82589994cce13 /*2^( 76 /128)*/ + .quad 0x3ff8471a4623c7ad /*2^( 77 /128)*/ + .quad 0x3ff868d99b4492ed /*2^( 78 /128)*/ + .quad 0x3ff88ac7d98a6699 /*2^( 79 /128)*/ + .quad 0x3ff8ace5422aa0db /*2^( 80 /128)*/ + .quad 0x3ff8cf3216b5448c /*2^( 81 /128)*/ + .quad 0x3ff8f1ae99157736 /*2^( 82 /128)*/ + .quad 0x3ff9145b0b91ffc6 /*2^( 83 /128)*/ + .quad 0x3ff93737b0cdc5e5 /*2^( 84 /128)*/ + .quad 0x3ff95a44cbc8520f /*2^( 85 /128)*/ + .quad 0x3ff97d829fde4e50 /*2^( 86 /128)*/ + .quad 0x3ff9a0f170ca07ba /*2^( 87 /128)*/ + .quad 0x3ff9c49182a3f090 /*2^( 88 /128)*/ + .quad 0x3ff9e86319e32323 /*2^( 89 /128)*/ + .quad 0x3ffa0c667b5de565 /*2^( 90 /128)*/ + .quad 0x3ffa309bec4a2d33 /*2^( 91 /128)*/ + .quad 0x3ffa5503b23e255d /*2^( 92 /128)*/ + .quad 0x3ffa799e1330b358 /*2^( 93 /128)*/ + .quad 0x3ffa9e6b5579fdbf /*2^( 94 /128)*/ + .quad 0x3ffac36bbfd3f37a /*2^( 95 /128)*/ + .quad 0x3ffae89f995ad3ad /*2^( 96 /128)*/ + .quad 0x3ffb0e07298db666 /*2^( 97 /128)*/ + .quad 0x3ffb33a2b84f15fb /*2^( 98 /128)*/ + .quad 0x3ffb59728de5593a /*2^( 99 /128)*/ + .quad 0x3ffb7f76f2fb5e47 /*2^( 100 /128)*/ + .quad 0x3ffba5b030a1064a /*2^( 101 /128)*/ + .quad 0x3ffbcc1e904bc1d2 /*2^( 102 /128)*/ + .quad 0x3ffbf2c25bd71e09 /*2^( 103 /128)*/ + .quad 0x3ffc199bdd85529c /*2^( 104 /128)*/ + .quad 0x3ffc40ab5fffd07a /*2^( 105 /128)*/ + .quad 0x3ffc67f12e57d14b /*2^( 106 /128)*/ + .quad 0x3ffc8f6d9406e7b5 /*2^( 107 /128)*/ + .quad 0x3ffcb720dcef9069 /*2^( 108 /128)*/ + .quad 0x3ffcdf0b555dc3fa /*2^( 109 /128)*/ + .quad 0x3ffd072d4a07897c /*2^( 110 /128)*/ + .quad 0x3ffd2f87080d89f2 /*2^( 111 /128)*/ + .quad 0x3ffd5818dcfba487 /*2^( 112 /128)*/ + .quad 0x3ffd80e316c98398 /*2^( 113 /128)*/ + .quad 0x3ffda9e603db3285 /*2^( 114 /128)*/ + .quad 0x3ffdd321f301b460 /*2^( 115 /128)*/ + .quad 0x3ffdfc97337b9b5f /*2^( 116 /128)*/ + .quad 0x3ffe264614f5a129 /*2^( 117 /128)*/ + .quad 0x3ffe502ee78b3ff6 /*2^( 118 /128)*/ + .quad 0x3ffe7a51fbc74c83 /*2^( 119 /128)*/ + .quad 0x3ffea4afa2a490da /*2^( 120 /128)*/ + .quad 0x3ffecf482d8e67f1 /*2^( 121 /128)*/ + .quad 0x3ffefa1bee615a27 /*2^( 122 /128)*/ + .quad 0x3fff252b376bba97 /*2^( 123 /128)*/ + .quad 0x3fff50765b6e4540 /*2^( 124 /128)*/ + .quad 0x3fff7bfdad9cbe14 /*2^( 125 /128)*/ + .quad 0x3fffa7c1819e90d8 /*2^( 126 /128)*/ + .quad 0x3fffd3c22b8f71f1 /*2^( 127 /128)*/ + .align 16 + .quad 0x407a934f0979a371, 0x407a934f0979a371 /* _dbLg2_10*2^K */ + .align 16 + .quad 0x4338800000000000, 0x4338800000000000 /* _dbShifter */ + .align 16 + .quad 0x3f63441350a00000, 0x3f63441350a00000 /* _dbInvLg2_10hi/2^K 53-11-K bits*/ + .align 16 + .quad 0xbd10c0219dc1da99, 0xbd10c0219dc1da99 /* _dbInvLg2_10lo/2^K */ + //PC0 = 1.0 + .align 16 + .quad 0x40026bb1bbb55516, 0x40026bb1bbb55516 /* _dPC1 */ + .align 16 + .quad 0x40053524c73ce8e3, 0x40053524c73ce8e3 /* _dPC2 */ + .align 16 + .quad 0x4000470591ccea8b, 0x4000470591ccea8b /* _dPC3 */ + .align 16 + .quad 0x3ff2bd767584db59, 0x3ff2bd767584db59 /* _dPC4 */ + .align 16 + .quad 0x3fe144c03efafb54, 0x3fe144c03efafb54 /* _dPC5 */ + .align 16 + .quad 0xfff0000000000000, 0xfff0000000000000 /* _lExpMask */ + .align 16 + .long 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f /* _iIndexMask =(2^K-1)*/ + //common + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 16 + .long 0x40733a70, 0x40733a70, 0x40733a70, 0x40733a70 /* _iDomainRange */ + .align 16 + .type __svml_dexp10_data_internal,@object + .size __svml_dexp10_data_internal,.-__svml_dexp10_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core-sse.S new file mode 100644 index 0000000000..260c052143 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized exp10, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_exp10 _ZGVdN4v_exp10_sse_wrapper +#include "../svml_d_exp104_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core.c new file mode 100644 index 0000000000..e3e302be72 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized exp10, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_exp10 +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_exp10, __GI__ZGVdN4v_exp10, __redirect__ZGVdN4v_exp10) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core_avx2.S new file mode 100644 index 0000000000..1a53f43c9e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp104_core_avx2.S @@ -0,0 +1,429 @@ +/* Function exp10 vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp10(x) = 2^x/log10(2) = 2^n * (1 + T[j]) * (1 + P(y)) + * where + * x = m*log10(2)/K + y, y in [-log10(2)/K..log10(2)/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp10(x)-1 + * on small interval [-log10(2)/K..log10(2)/K] + * + * Special cases: + * + * exp10(NaN) = NaN + * exp10(+INF) = +INF + * exp10(-INF) = 0 + * exp10(x) = 1 for subnormals + * For IEEE double + * if x > 3.39782712893383973096e+02 then exp10(x) overflow + * if x < -3.45133219101941108420e+02 then exp10(x) underflow + * + */ + +/* Offsets for data table __svml_dexp10_data_internal + */ +#define _dbT 0 +#define _dbLg2_10 1024 +#define _dbShifter 1056 +#define _dbInvLg2_10hi 1088 +#define _dbInvLg2_10lo 1120 +#define _dPC1 1152 +#define _dPC2 1184 +#define _dPC3 1216 +#define _dPC4 1248 +#define _dPC5 1280 +#define _lExpMask 1312 +#define _iIndexMask 1344 +#define _iAbsMask 1376 +#define _iDomainRange 1408 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_exp10_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea __svml_dexp10_data_internal(%rip), %r8 + vmovapd %ymm0, %ymm2 + vmovupd _dbShifter+__svml_dexp10_data_internal(%rip), %ymm3 + +/* Load arument */ + vmovupd _dbLg2_10+__svml_dexp10_data_internal(%rip), %ymm0 + vfmadd213pd %ymm3, %ymm2, %ymm0 + vsubpd %ymm3, %ymm0, %ymm1 + +/* R */ + vmovupd _dbInvLg2_10hi+__svml_dexp10_data_internal(%rip), %ymm3 + vfnmadd213pd %ymm2, %ymm1, %ymm3 + +/* Check for overflow\underflow */ + vextractf128 $1, %ymm2, %xmm4 + vfnmadd132pd _dbInvLg2_10lo+__svml_dexp10_data_internal(%rip), %ymm3, %ymm1 + vshufps $221, %xmm4, %xmm2, %xmm5 + vandps _iAbsMask+__svml_dexp10_data_internal(%rip), %xmm5, %xmm6 + vpcmpgtd _iDomainRange+__svml_dexp10_data_internal(%rip), %xmm6, %xmm7 + +/* + * Polynomial + * poly(dN) = a1*dR+...+a5*dR^5 + */ + vmovupd _dPC5+__svml_dexp10_data_internal(%rip), %ymm4 + vmovmskps %xmm7, %eax + vfmadd213pd _dPC4+__svml_dexp10_data_internal(%rip), %ymm1, %ymm4 + vfmadd213pd _dPC3+__svml_dexp10_data_internal(%rip), %ymm1, %ymm4 + vfmadd213pd _dPC2+__svml_dexp10_data_internal(%rip), %ymm1, %ymm4 + +/* a1+...+a5*dR^4 ! */ + vfmadd213pd _dPC1+__svml_dexp10_data_internal(%rip), %ymm1, %ymm4 + +/* a1*dR+...+a5*dR^5 */ + vmulpd %ymm4, %ymm1, %ymm1 + +/* Index and lookup */ + vextractf128 $1, %ymm0, %xmm8 + vshufps $136, %xmm8, %xmm0, %xmm9 + vandps _iIndexMask+__svml_dexp10_data_internal(%rip), %xmm9, %xmm10 + +/* iIndex*=sizeof(D); */ + vpslld $3, %xmm10, %xmm13 + vmovd %xmm13, %edx + +/* 2^N */ + vpsllq $45, %ymm0, %ymm0 + vpextrd $2, %xmm13, %esi + movslq %edx, %rdx + vpextrd $1, %xmm13, %ecx + movslq %esi, %rsi + vpextrd $3, %xmm13, %edi + movslq %ecx, %rcx + movslq %edi, %rdi + vmovsd (%r8,%rdx), %xmm11 + vmovsd (%r8,%rsi), %xmm14 + vmovhpd (%r8,%rcx), %xmm11, %xmm12 + vmovhpd (%r8,%rdi), %xmm14, %xmm15 + +/* lM==EXP(2^N) */ + vpand _lExpMask+__svml_dexp10_data_internal(%rip), %ymm0, %ymm6 + vinsertf128 $1, %xmm15, %ymm12, %ymm5 + +/* Tj*poly */ + vfmadd213pd %ymm5, %ymm5, %ymm1 + +/* quick 2^N */ + vpaddq %ymm6, %ymm1, %ymm0 + +/* Finish */ + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm2, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call exp10@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_exp10_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dexp10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dbT[(1<<7)][2]; + __declspec(align(32)) VUINT32 _dbLg2_10[4][2]; + __declspec(align(32)) VUINT32 _dbShifter[4][2]; + __declspec(align(32)) VUINT32 _dbInvLg2_10hi[4][2]; + __declspec(align(32)) VUINT32 _dbInvLg2_10lo[4][2]; + __declspec(align(32)) VUINT32 _dPC1[4][2]; + __declspec(align(32)) VUINT32 _dPC2[4][2]; + __declspec(align(32)) VUINT32 _dPC3[4][2]; + __declspec(align(32)) VUINT32 _dPC4[4][2]; + __declspec(align(32)) VUINT32 _dPC5[4][2]; + __declspec(align(32)) VUINT32 _lExpMask[4][2]; + __declspec(align(32)) VUINT32 _iIndexMask[8][1]; + __declspec(align(32)) VUINT32 _iAbsMask[8][1]; + __declspec(align(32)) VUINT32 _iDomainRange[8][1]; +} __svml_dexp10_data_internal; +#endif +__svml_dexp10_data_internal: + /*== _dbT ==*/ + .quad 0x3ff0000000000000 /*2^( 0 /128)*/ + .quad 0x3ff0163da9fb3335 /*2^( 1 /128)*/ + .quad 0x3ff02c9a3e778061 /*2^( 2 /128)*/ + .quad 0x3ff04315e86e7f85 /*2^( 3 /128)*/ + .quad 0x3ff059b0d3158574 /*2^( 4 /128)*/ + .quad 0x3ff0706b29ddf6de /*2^( 5 /128)*/ + .quad 0x3ff0874518759bc8 /*2^( 6 /128)*/ + .quad 0x3ff09e3ecac6f383 /*2^( 7 /128)*/ + .quad 0x3ff0b5586cf9890f /*2^( 8 /128)*/ + .quad 0x3ff0cc922b7247f7 /*2^( 9 /128)*/ + .quad 0x3ff0e3ec32d3d1a2 /*2^( 10 /128)*/ + .quad 0x3ff0fb66affed31b /*2^( 11 /128)*/ + .quad 0x3ff11301d0125b51 /*2^( 12 /128)*/ + .quad 0x3ff12abdc06c31cc /*2^( 13 /128)*/ + .quad 0x3ff1429aaea92de0 /*2^( 14 /128)*/ + .quad 0x3ff15a98c8a58e51 /*2^( 15 /128)*/ + .quad 0x3ff172b83c7d517b /*2^( 16 /128)*/ + .quad 0x3ff18af9388c8dea /*2^( 17 /128)*/ + .quad 0x3ff1a35beb6fcb75 /*2^( 18 /128)*/ + .quad 0x3ff1bbe084045cd4 /*2^( 19 /128)*/ + .quad 0x3ff1d4873168b9aa /*2^( 20 /128)*/ + .quad 0x3ff1ed5022fcd91d /*2^( 21 /128)*/ + .quad 0x3ff2063b88628cd6 /*2^( 22 /128)*/ + .quad 0x3ff21f49917ddc96 /*2^( 23 /128)*/ + .quad 0x3ff2387a6e756238 /*2^( 24 /128)*/ + .quad 0x3ff251ce4fb2a63f /*2^( 25 /128)*/ + .quad 0x3ff26b4565e27cdd /*2^( 26 /128)*/ + .quad 0x3ff284dfe1f56381 /*2^( 27 /128)*/ + .quad 0x3ff29e9df51fdee1 /*2^( 28 /128)*/ + .quad 0x3ff2b87fd0dad990 /*2^( 29 /128)*/ + .quad 0x3ff2d285a6e4030b /*2^( 30 /128)*/ + .quad 0x3ff2ecafa93e2f56 /*2^( 31 /128)*/ + .quad 0x3ff306fe0a31b715 /*2^( 32 /128)*/ + .quad 0x3ff32170fc4cd831 /*2^( 33 /128)*/ + .quad 0x3ff33c08b26416ff /*2^( 34 /128)*/ + .quad 0x3ff356c55f929ff1 /*2^( 35 /128)*/ + .quad 0x3ff371a7373aa9cb /*2^( 36 /128)*/ + .quad 0x3ff38cae6d05d866 /*2^( 37 /128)*/ + .quad 0x3ff3a7db34e59ff7 /*2^( 38 /128)*/ + .quad 0x3ff3c32dc313a8e5 /*2^( 39 /128)*/ + .quad 0x3ff3dea64c123422 /*2^( 40 /128)*/ + .quad 0x3ff3fa4504ac801c /*2^( 41 /128)*/ + .quad 0x3ff4160a21f72e2a /*2^( 42 /128)*/ + .quad 0x3ff431f5d950a897 /*2^( 43 /128)*/ + .quad 0x3ff44e086061892d /*2^( 44 /128)*/ + .quad 0x3ff46a41ed1d0057 /*2^( 45 /128)*/ + .quad 0x3ff486a2b5c13cd0 /*2^( 46 /128)*/ + .quad 0x3ff4a32af0d7d3de /*2^( 47 /128)*/ + .quad 0x3ff4bfdad5362a27 /*2^( 48 /128)*/ + .quad 0x3ff4dcb299fddd0d /*2^( 49 /128)*/ + .quad 0x3ff4f9b2769d2ca7 /*2^( 50 /128)*/ + .quad 0x3ff516daa2cf6642 /*2^( 51 /128)*/ + .quad 0x3ff5342b569d4f82 /*2^( 52 /128)*/ + .quad 0x3ff551a4ca5d920f /*2^( 53 /128)*/ + .quad 0x3ff56f4736b527da /*2^( 54 /128)*/ + .quad 0x3ff58d12d497c7fd /*2^( 55 /128)*/ + .quad 0x3ff5ab07dd485429 /*2^( 56 /128)*/ + .quad 0x3ff5c9268a5946b7 /*2^( 57 /128)*/ + .quad 0x3ff5e76f15ad2148 /*2^( 58 /128)*/ + .quad 0x3ff605e1b976dc09 /*2^( 59 /128)*/ + .quad 0x3ff6247eb03a5585 /*2^( 60 /128)*/ + .quad 0x3ff6434634ccc320 /*2^( 61 /128)*/ + .quad 0x3ff6623882552225 /*2^( 62 /128)*/ + .quad 0x3ff68155d44ca973 /*2^( 63 /128)*/ + .quad 0x3ff6a09e667f3bcd /*2^( 64 /128)*/ + .quad 0x3ff6c012750bdabf /*2^( 65 /128)*/ + .quad 0x3ff6dfb23c651a2f /*2^( 66 /128)*/ + .quad 0x3ff6ff7df9519484 /*2^( 67 /128)*/ + .quad 0x3ff71f75e8ec5f74 /*2^( 68 /128)*/ + .quad 0x3ff73f9a48a58174 /*2^( 69 /128)*/ + .quad 0x3ff75feb564267c9 /*2^( 70 /128)*/ + .quad 0x3ff780694fde5d3f /*2^( 71 /128)*/ + .quad 0x3ff7a11473eb0187 /*2^( 72 /128)*/ + .quad 0x3ff7c1ed0130c132 /*2^( 73 /128)*/ + .quad 0x3ff7e2f336cf4e62 /*2^( 74 /128)*/ + .quad 0x3ff80427543e1a12 /*2^( 75 /128)*/ + .quad 0x3ff82589994cce13 /*2^( 76 /128)*/ + .quad 0x3ff8471a4623c7ad /*2^( 77 /128)*/ + .quad 0x3ff868d99b4492ed /*2^( 78 /128)*/ + .quad 0x3ff88ac7d98a6699 /*2^( 79 /128)*/ + .quad 0x3ff8ace5422aa0db /*2^( 80 /128)*/ + .quad 0x3ff8cf3216b5448c /*2^( 81 /128)*/ + .quad 0x3ff8f1ae99157736 /*2^( 82 /128)*/ + .quad 0x3ff9145b0b91ffc6 /*2^( 83 /128)*/ + .quad 0x3ff93737b0cdc5e5 /*2^( 84 /128)*/ + .quad 0x3ff95a44cbc8520f /*2^( 85 /128)*/ + .quad 0x3ff97d829fde4e50 /*2^( 86 /128)*/ + .quad 0x3ff9a0f170ca07ba /*2^( 87 /128)*/ + .quad 0x3ff9c49182a3f090 /*2^( 88 /128)*/ + .quad 0x3ff9e86319e32323 /*2^( 89 /128)*/ + .quad 0x3ffa0c667b5de565 /*2^( 90 /128)*/ + .quad 0x3ffa309bec4a2d33 /*2^( 91 /128)*/ + .quad 0x3ffa5503b23e255d /*2^( 92 /128)*/ + .quad 0x3ffa799e1330b358 /*2^( 93 /128)*/ + .quad 0x3ffa9e6b5579fdbf /*2^( 94 /128)*/ + .quad 0x3ffac36bbfd3f37a /*2^( 95 /128)*/ + .quad 0x3ffae89f995ad3ad /*2^( 96 /128)*/ + .quad 0x3ffb0e07298db666 /*2^( 97 /128)*/ + .quad 0x3ffb33a2b84f15fb /*2^( 98 /128)*/ + .quad 0x3ffb59728de5593a /*2^( 99 /128)*/ + .quad 0x3ffb7f76f2fb5e47 /*2^( 100 /128)*/ + .quad 0x3ffba5b030a1064a /*2^( 101 /128)*/ + .quad 0x3ffbcc1e904bc1d2 /*2^( 102 /128)*/ + .quad 0x3ffbf2c25bd71e09 /*2^( 103 /128)*/ + .quad 0x3ffc199bdd85529c /*2^( 104 /128)*/ + .quad 0x3ffc40ab5fffd07a /*2^( 105 /128)*/ + .quad 0x3ffc67f12e57d14b /*2^( 106 /128)*/ + .quad 0x3ffc8f6d9406e7b5 /*2^( 107 /128)*/ + .quad 0x3ffcb720dcef9069 /*2^( 108 /128)*/ + .quad 0x3ffcdf0b555dc3fa /*2^( 109 /128)*/ + .quad 0x3ffd072d4a07897c /*2^( 110 /128)*/ + .quad 0x3ffd2f87080d89f2 /*2^( 111 /128)*/ + .quad 0x3ffd5818dcfba487 /*2^( 112 /128)*/ + .quad 0x3ffd80e316c98398 /*2^( 113 /128)*/ + .quad 0x3ffda9e603db3285 /*2^( 114 /128)*/ + .quad 0x3ffdd321f301b460 /*2^( 115 /128)*/ + .quad 0x3ffdfc97337b9b5f /*2^( 116 /128)*/ + .quad 0x3ffe264614f5a129 /*2^( 117 /128)*/ + .quad 0x3ffe502ee78b3ff6 /*2^( 118 /128)*/ + .quad 0x3ffe7a51fbc74c83 /*2^( 119 /128)*/ + .quad 0x3ffea4afa2a490da /*2^( 120 /128)*/ + .quad 0x3ffecf482d8e67f1 /*2^( 121 /128)*/ + .quad 0x3ffefa1bee615a27 /*2^( 122 /128)*/ + .quad 0x3fff252b376bba97 /*2^( 123 /128)*/ + .quad 0x3fff50765b6e4540 /*2^( 124 /128)*/ + .quad 0x3fff7bfdad9cbe14 /*2^( 125 /128)*/ + .quad 0x3fffa7c1819e90d8 /*2^( 126 /128)*/ + .quad 0x3fffd3c22b8f71f1 /*2^( 127 /128)*/ + .align 32 + .quad 0x407a934f0979a371, 0x407a934f0979a371, 0x407a934f0979a371, 0x407a934f0979a371 /* _dbLg2_10*2^K */ + .align 32 + .quad 0x4338800000000000, 0x4338800000000000, 0x4338800000000000, 0x4338800000000000 /* _dbShifter */ + .align 32 + .quad 0x3f63441350a00000, 0x3f63441350a00000, 0x3f63441350a00000, 0x3f63441350a00000 /* _dbInvLg2_10hi/2^K 53-11-K bits*/ + .align 32 + .quad 0xbd10c0219dc1da99, 0xbd10c0219dc1da99, 0xbd10c0219dc1da99, 0xbd10c0219dc1da99 /* _dbInvLg2_10lo/2^K */ + //PC0 = 1.0 + .align 32 + .quad 0x40026bb1bbb55516, 0x40026bb1bbb55516, 0x40026bb1bbb55516, 0x40026bb1bbb55516 /* _dPC1 */ + .align 32 + .quad 0x40053524c73ce8e3, 0x40053524c73ce8e3, 0x40053524c73ce8e3, 0x40053524c73ce8e3 /* _dPC2 */ + .align 32 + .quad 0x4000470591ccea8b, 0x4000470591ccea8b, 0x4000470591ccea8b, 0x4000470591ccea8b /* _dPC3 */ + .align 32 + .quad 0x3ff2bd767584db59, 0x3ff2bd767584db59, 0x3ff2bd767584db59, 0x3ff2bd767584db59 /* _dPC4 */ + .align 32 + .quad 0x3fe144c03efafb54, 0x3fe144c03efafb54, 0x3fe144c03efafb54, 0x3fe144c03efafb54 /* _dPC5 */ + .align 32 + .quad 0xfff0000000000000, 0xfff0000000000000, 0xfff0000000000000, 0xfff0000000000000 /* _lExpMask */ + .align 32 + .long 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f /* _iIndexMask =(2^K-1)*/ + //common + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 32 + .long 0x40733a70, 0x40733a70, 0x40733a70, 0x40733a70, 0x40733a70, 0x40733a70, 0x40733a70, 0x40733a70 /* _iDomainRange */ + .align 32 + .type __svml_dexp10_data_internal,@object + .size __svml_dexp10_data_internal,.-__svml_dexp10_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core-avx2.S new file mode 100644 index 0000000000..3aff9446d3 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized exp10, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_exp10 _ZGVeN8v_exp10_avx2_wrapper +#include "../svml_d_exp108_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core.c new file mode 100644 index 0000000000..d592663169 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized exp10, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_exp10 +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_exp10, __GI__ZGVeN8v_exp10, __redirect__ZGVeN8v_exp10) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core_avx512.S new file mode 100644 index 0000000000..953cb5bc1a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_exp108_core_avx512.S @@ -0,0 +1,287 @@ +/* Function exp10 vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * Typical exp10() implementation, except that: + * - tables are small (16 elements), allowing for fast gathers + * - all arguments processed in the main path + * - final VSCALEF assists branch-free design (correct overflow/underflow and special case responses) + * - a VAND is used to ensure the reduced argument |R|<2, even for large inputs + * - RZ mode used to avoid oveflow to +/-Inf for x*log2(e); helps with special case handling + * - SAE used to avoid spurious flag settings + * + */ + +/* Offsets for data table __svml_dexp10_data_internal_avx512 + */ +#define Exp_tbl_H 0 +#define L2E 128 +#define Shifter 192 +#define L2H 256 +#define L2L 320 +#define EMask 384 +#define poly_coeff6 448 +#define poly_coeff5 512 +#define poly_coeff4 576 +#define poly_coeff3 640 +#define poly_coeff2 704 +#define poly_coeff1 768 +#define AbsMask 832 +#define Threshold 896 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_exp10_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups L2E+__svml_dexp10_data_internal_avx512(%rip), %zmm4 + vmovups Shifter+__svml_dexp10_data_internal_avx512(%rip), %zmm2 + vmovups L2H+__svml_dexp10_data_internal_avx512(%rip), %zmm5 + vmovups L2L+__svml_dexp10_data_internal_avx512(%rip), %zmm3 + +/* polynomial */ + vmovups poly_coeff6+__svml_dexp10_data_internal_avx512(%rip), %zmm6 + vmovups poly_coeff4+__svml_dexp10_data_internal_avx512(%rip), %zmm7 + vmovups poly_coeff3+__svml_dexp10_data_internal_avx512(%rip), %zmm9 + vmovups poly_coeff2+__svml_dexp10_data_internal_avx512(%rip), %zmm8 + vmovups poly_coeff1+__svml_dexp10_data_internal_avx512(%rip), %zmm11 + vmovups Threshold+__svml_dexp10_data_internal_avx512(%rip), %zmm14 + vmovaps %zmm0, %zmm1 + +/* 2^(52-4)*1.5 + x * log2(e) */ + vfmadd213pd {rz-sae}, %zmm2, %zmm1, %zmm4 + vandpd AbsMask+__svml_dexp10_data_internal_avx512(%rip), %zmm1, %zmm13 + +/* Z0 ~ x*log2(e), rounded down to 4 fractional bits */ + vsubpd {rn-sae}, %zmm2, %zmm4, %zmm0 + +/* Table lookup: Th */ + vmovups __svml_dexp10_data_internal_avx512(%rip), %zmm2 + vcmppd $29, {sae}, %zmm14, %zmm13, %k0 + +/* R = x - Z0*log(2) */ + vfnmadd213pd {rn-sae}, %zmm1, %zmm0, %zmm5 + vpermt2pd Exp_tbl_H+64+__svml_dexp10_data_internal_avx512(%rip), %zmm4, %zmm2 + kmovw %k0, %edx + vfnmadd231pd {rn-sae}, %zmm0, %zmm3, %zmm5 + vmovups poly_coeff5+__svml_dexp10_data_internal_avx512(%rip), %zmm3 + +/* ensure |R|<2 even for special cases */ + vandpd EMask+__svml_dexp10_data_internal_avx512(%rip), %zmm5, %zmm12 + vmulpd {rn-sae}, %zmm12, %zmm12, %zmm10 + vmulpd {rn-sae}, %zmm12, %zmm2, %zmm15 + vfmadd231pd {rn-sae}, %zmm12, %zmm6, %zmm3 + vfmadd231pd {rn-sae}, %zmm12, %zmm7, %zmm9 + vfmadd231pd {rn-sae}, %zmm12, %zmm8, %zmm11 + vfmadd213pd {rn-sae}, %zmm9, %zmm10, %zmm3 + vfmadd213pd {rn-sae}, %zmm11, %zmm10, %zmm3 + vfmadd213pd {rn-sae}, %zmm2, %zmm15, %zmm3 + vscalefpd {rn-sae}, %zmm0, %zmm3, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm1, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call exp10@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_exp10_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dexp10_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Exp_tbl_H[16][2]; + __declspec(align(64)) VUINT32 L2E[8][2]; + __declspec(align(64)) VUINT32 Shifter[8][2]; + __declspec(align(64)) VUINT32 L2H[8][2]; + __declspec(align(64)) VUINT32 L2L[8][2]; + __declspec(align(64)) VUINT32 EMask[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + __declspec(align(64)) VUINT32 AbsMask[8][2]; + __declspec(align(64)) VUINT32 Threshold[8][2]; + } __svml_dexp10_data_internal_avx512; +#endif +__svml_dexp10_data_internal_avx512: + /*== Exp_tbl_H ==*/ + .quad 0x3ff0000000000000 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff172b83c7d517b + .quad 0x3ff2387a6e756238 + .quad 0x3ff306fe0a31b715 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff5ab07dd485429 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff7a11473eb0187 + .quad 0x3ff8ace5422aa0db + .quad 0x3ff9c49182a3f090 + .quad 0x3ffae89f995ad3ad + .quad 0x3ffc199bdd85529c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffea4afa2a490da + /*== log2(e) ==*/ + .align 64 + .quad 0x400A934F0979A371, 0x400A934F0979A371, 0x400A934F0979A371, 0x400A934F0979A371, 0x400A934F0979A371, 0x400A934F0979A371, 0x400A934F0979A371, 0x400A934F0979A371 + /*== Shifter=2^(52-4)*1.5 ==*/ + .align 64 + .quad 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0 + /*== L2H = log(2)_high ==*/ + .align 64 + .quad 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff + /*== L2L = log(2)_low ==*/ + .align 64 + .quad 0xbc49dc1da994fd21, 0xbc49dc1da994fd21, 0xbc49dc1da994fd21, 0xbc49dc1da994fd21, 0xbc49dc1da994fd21, 0xbc49dc1da994fd21, 0xbc49dc1da994fd21, 0xbc49dc1da994fd21 + /*== EMask ==*/ + .align 64 + .quad 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff + /*== poly_coeff6 ==*/ + .align 64 + .quad 0x3fcb137ed8ac2020, 0x3fcb137ed8ac2020, 0x3fcb137ed8ac2020, 0x3fcb137ed8ac2020, 0x3fcb137ed8ac2020, 0x3fcb137ed8ac2020, 0x3fcb137ed8ac2020, 0x3fcb137ed8ac2020 + /*== poly_coeff5 ==*/ + .align 64 + .quad 0x3fe141a8e24f9424, 0x3fe141a8e24f9424, 0x3fe141a8e24f9424, 0x3fe141a8e24f9424, 0x3fe141a8e24f9424, 0x3fe141a8e24f9424, 0x3fe141a8e24f9424, 0x3fe141a8e24f9424 + /*== poly_coeff4 ==*/ + .align 64 + .quad 0x3ff2bd77a0926c9d, 0x3ff2bd77a0926c9d, 0x3ff2bd77a0926c9d, 0x3ff2bd77a0926c9d, 0x3ff2bd77a0926c9d, 0x3ff2bd77a0926c9d, 0x3ff2bd77a0926c9d, 0x3ff2bd77a0926c9d + /*== poly_coeff3 ==*/ + .align 64 + .quad 0x40004705908704c8, 0x40004705908704c8, 0x40004705908704c8, 0x40004705908704c8, 0x40004705908704c8, 0x40004705908704c8, 0x40004705908704c8, 0x40004705908704c8 + /*== poly_coeff2 ==*/ + .align 64 + .quad 0x40053524c73dfe25, 0x40053524c73dfe25, 0x40053524c73dfe25, 0x40053524c73dfe25, 0x40053524c73dfe25, 0x40053524c73dfe25, 0x40053524c73dfe25, 0x40053524c73dfe25 + /*== poly_coeff1 ==*/ + .align 64 + .quad 0x40026bb1bbb554c2, 0x40026bb1bbb554c2, 0x40026bb1bbb554c2, 0x40026bb1bbb554c2, 0x40026bb1bbb554c2, 0x40026bb1bbb554c2, 0x40026bb1bbb554c2, 0x40026bb1bbb554c2 + /*== AbsMask ==*/ + .align 64 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== Threshold ==*/ + .align 64 + .quad 0x40733A7146F72A41, 0x40733A7146F72A41, 0x40733A7146F72A41, 0x40733A7146F72A41, 0x40733A7146F72A41, 0x40733A7146F72A41, 0x40733A7146F72A41, 0x40733A7146F72A41 + .align 64 + .type __svml_dexp10_data_internal_avx512,@object + .size __svml_dexp10_data_internal_avx512,.-__svml_dexp10_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core-avx2.S new file mode 100644 index 0000000000..dda41c9c8f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized exp10f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_exp10f _ZGVeN16v_exp10f_avx2_wrapper +#include "../svml_s_exp10f16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core.c new file mode 100644 index 0000000000..8176a5912b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized exp10f, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_exp10f +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_exp10f, __GI__ZGVeN16v_exp10f, + __redirect__ZGVeN16v_exp10f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core_avx512.S new file mode 100644 index 0000000000..fc9309c90f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f16_core_avx512.S @@ -0,0 +1,269 @@ +/* Function exp10f vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * Typical exp10() implementation, except that: + * - tables are small (16 elements), allowing for fast gathers + * - all arguments processed in the main path + * - final VSCALEF assists branch-free design (correct overflow/underflow and special case responses) + * - a VAND is used to ensure the reduced argument |R|<2, even for large inputs + * - RZ mode used to avoid oveflow to +/-Inf for x*log2(e); helps with special case handling + * - SAE used to avoid spurious flag settings + * + */ + +/* Offsets for data table __svml_sexp10_data_internal_avx512 + */ +#define Exp_tbl_L 0 +#define Exp_tbl_H 128 +#define L2E 256 +#define Shifter 320 +#define L2H 384 +#define L2L 448 +#define EMask 512 +#define AbsMask 576 +#define Threshold 640 +#define poly_coeff2 704 +#define poly_coeff1 768 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_exp10f_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups L2E+__svml_sexp10_data_internal_avx512(%rip), %zmm2 + vmovups Shifter+__svml_sexp10_data_internal_avx512(%rip), %zmm1 + vmovups L2H+__svml_sexp10_data_internal_avx512(%rip), %zmm5 + vmovups L2L+__svml_sexp10_data_internal_avx512(%rip), %zmm4 + +/* ensure |R|<2 even for special cases */ + vmovups EMask+__svml_sexp10_data_internal_avx512(%rip), %zmm6 + vmovups poly_coeff2+__svml_sexp10_data_internal_avx512(%rip), %zmm9 + +/* 2^(52-4)*1.5 + x * log2(e) */ + vfmadd213ps {rz-sae}, %zmm1, %zmm0, %zmm2 + vmovups poly_coeff1+__svml_sexp10_data_internal_avx512(%rip), %zmm10 + vmovups __svml_sexp10_data_internal_avx512(%rip), %zmm8 + vmovups Exp_tbl_H+__svml_sexp10_data_internal_avx512(%rip), %zmm15 + vmovups Threshold+__svml_sexp10_data_internal_avx512(%rip), %zmm13 + vpsrld $5, %zmm2, %zmm3 + +/* Z0 ~ x*log2(e), rounded down to 6 fractional bits */ + vsubps {rn-sae}, %zmm1, %zmm2, %zmm1 + vpermt2ps Exp_tbl_L+64+__svml_sexp10_data_internal_avx512(%rip), %zmm2, %zmm8 + vpermt2ps Exp_tbl_H+64+__svml_sexp10_data_internal_avx512(%rip), %zmm3, %zmm15 + vandps AbsMask+__svml_sexp10_data_internal_avx512(%rip), %zmm0, %zmm12 + +/* R = x - Z0*log(2) */ + vfnmadd213ps {rn-sae}, %zmm0, %zmm1, %zmm5 + vcmpps $29, {sae}, %zmm13, %zmm12, %k0 + vfnmadd231ps {rn-sae}, %zmm1, %zmm4, %zmm5 + kmovw %k0, %edx + vrangeps $2, {sae}, %zmm6, %zmm5, %zmm11 + vfmadd231ps {rn-sae}, %zmm11, %zmm9, %zmm10 + vmulps {rn-sae}, %zmm11, %zmm10, %zmm14 + +/* x!=0? */ + vpxord %zmm7, %zmm7, %zmm7 + vcmpps $4, {sae}, %zmm7, %zmm0, %k1 + +/* Th*Tl */ + vmulps {rn-sae}, %zmm8, %zmm15, %zmm15{%k1} + vfmadd213ps {rn-sae}, %zmm15, %zmm14, %zmm15 + vscalefps {rn-sae}, %zmm1, %zmm15, %zmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm1, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm1, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call exp10f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_exp10f_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_sexp10_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Exp_tbl_L[32][1]; + __declspec(align(64)) VUINT32 Exp_tbl_H[32][1]; + __declspec(align(64)) VUINT32 L2E[16][1]; + __declspec(align(64)) VUINT32 Shifter[16][1]; + __declspec(align(64)) VUINT32 L2H[16][1]; + __declspec(align(64)) VUINT32 L2L[16][1]; + __declspec(align(64)) VUINT32 EMask[16][1]; + __declspec(align(64)) VUINT32 AbsMask[16][1]; + __declspec(align(64)) VUINT32 Threshold[16][1]; + __declspec(align(64)) VUINT32 poly_coeff2[16][1]; + __declspec(align(64)) VUINT32 poly_coeff1[16][1]; + } __svml_sexp10_data_internal_avx512; +#endif +__svml_sexp10_data_internal_avx512: + /*== Exp_tbl_L ==*/ + .long 0x3f800001, 0x3f801631, 0x3f802c65, 0x3f80429d + .long 0x3f8058d9, 0x3f806f18, 0x3f80855c, 0x3f809ba3 + .long 0x3f80b1ee, 0x3f80c83d, 0x3f80de90, 0x3f80f4e7 + .long 0x3f810b42, 0x3f8121a0, 0x3f813803, 0x3f814e69 + .long 0x3f8164d3, 0x3f817b41, 0x3f8191b3, 0x3f81a829 + .long 0x3f81bea2, 0x3f81d520, 0x3f81eba2, 0x3f820227 + .long 0x3f8218b0, 0x3f822f3d, 0x3f8245cf, 0x3f825c64 + .long 0x3f8272fd, 0x3f828999, 0x3f82a03a, 0x3f82b6df + /*== Exp_tbl_H ==*/ + .align 64 + .long 0x3f800000, 0x3f82cd87, 0x3f85aac3, 0x3f88980f + .long 0x3f8b95c2, 0x3f8ea43a, 0x3f91c3d3, 0x3f94f4f0 + .long 0x3f9837f0, 0x3f9b8d3a, 0x3f9ef532, 0x3fa27043 + .long 0x3fa5fed7, 0x3fa9a15b, 0x3fad583f, 0x3fb123f6 + .long 0x3fb504f3, 0x3fb8fbaf, 0x3fbd08a4, 0x3fc12c4d + .long 0x3fc5672a, 0x3fc9b9be, 0x3fce248c, 0x3fd2a81e + .long 0x3fd744fd, 0x3fdbfbb8, 0x3fe0ccdf, 0x3fe5b907 + .long 0x3feac0c7, 0x3fefe4ba, 0x3ff5257d, 0x3ffa83b3 + /*== log2(10) ==*/ + .align 64 + .long 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78, 0x40549A78 + /*== Shifter=2^(23-10)*1.5 ==*/ + .align 64 + .long 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000, 0x46400000 + /*== L2H = log(2)_high ==*/ + .align 64 + .long 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b + /*== L2L = log(2)_low ==*/ + .align 64 + .long 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860, 0xb2760860 + /*== EMask ==*/ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 + /*== AbsMask ==*/ + .align 64 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== Threshold ==*/ + .align 64 + .long 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818, 0x4217B818 + /*== poly_coeff2 ==*/ + .align 64 + .long 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA, 0x4029B7DA + /*== poly_coeff1 ==*/ + .align 64 + .long 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D, 0x40135D8D + .align 64 + .type __svml_sexp10_data_internal_avx512,@object + .size __svml_sexp10_data_internal_avx512,.-__svml_sexp10_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core-sse2.S new file mode 100644 index 0000000000..460d01357d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized exp10f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_exp10f _ZGVbN4v_exp10f_sse2 +#include "../svml_s_exp10f4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core.c new file mode 100644 index 0000000000..7ce90a9bae --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized exp10f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_exp10f +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_exp10f, __GI__ZGVbN4v_exp10f, + __redirect__ZGVbN4v_exp10f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core_sse4.S new file mode 100644 index 0000000000..879592b789 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f4_core_sse4.S @@ -0,0 +1,311 @@ +/* Function exp10f vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp10(x) = 2^x/log10(2) = 2^n * (1 + T[j]) * (1 + P(y)) + * where + * x = m*log10(2)/K + y, y in [-log10(2)/K..log10(2)/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp10(x)-1 + * on small interval [-log10(2)/K..log10(2)/K] + * + * Special cases: + * + * exp10(NaN) = NaN + * exp10(+INF) = +INF + * exp10(-INF) = 0 + * exp10(x) = 1 for subnormals + * For IEEE float + * if x > 38.5318412780761720 then exp10f(x) overflow + * if x < -45.4555282592773440 then exp10f(x) underflow + * + */ + +/* Offsets for data table __svml_sexp10_data_internal + */ +#define _sT 0 +#define _sLg2_10 128 +#define _sShifter 144 +#define _sInvLg2_10hi 160 +#define _sInvLg2_10lo 176 +#define _sPC0 192 +#define _sPC1 208 +#define _sPC2 224 +#define _iIndexMask 240 +#define _iAbsMask 256 +#define _iDomainRange 272 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_exp10f_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm4 + +/* Load arument */ + movups _sLg2_10+__svml_sexp10_data_internal(%rip), %xmm2 + lea __svml_sexp10_data_internal(%rip), %r8 + mulps %xmm4, %xmm2 + movups _sShifter+__svml_sexp10_data_internal(%rip), %xmm5 + +/* R */ + movups _sInvLg2_10hi+__svml_sexp10_data_internal(%rip), %xmm14 + addps %xmm5, %xmm2 + movaps %xmm2, %xmm1 + movups _sInvLg2_10lo+__svml_sexp10_data_internal(%rip), %xmm15 + subps %xmm5, %xmm1 + mulps %xmm1, %xmm14 + movaps %xmm4, %xmm5 + mulps %xmm1, %xmm15 + subps %xmm14, %xmm5 + +/* + * Polynomial + * exp10 = 2^N*(Tj+Tj*poly) + * poly(sN) = {1+later} a0+a1*sR + */ + movups _sPC2+__svml_sexp10_data_internal(%rip), %xmm1 + subps %xmm15, %xmm5 + mulps %xmm5, %xmm1 + movdqu _iIndexMask+__svml_sexp10_data_internal(%rip), %xmm3 + +/* Index and lookup */ + movdqa %xmm3, %xmm10 + +/* remove index bits */ + pandn %xmm2, %xmm3 + pand %xmm2, %xmm10 + +/* 2^N */ + pslld $18, %xmm3 + +/* iIndex *= sizeof(S); */ + pslld $2, %xmm10 + addps _sPC1+__svml_sexp10_data_internal(%rip), %xmm1 + movd %xmm10, %edx + pshufd $1, %xmm10, %xmm7 + pshufd $2, %xmm10, %xmm9 + pshufd $3, %xmm10, %xmm11 + movd %xmm7, %ecx + movd %xmm9, %esi + movd %xmm11, %edi + +/* Check for overflow\underflow */ + movdqu _iAbsMask+__svml_sexp10_data_internal(%rip), %xmm6 + pand %xmm4, %xmm6 + mulps %xmm1, %xmm5 + movslq %edx, %rdx + addps _sPC0+__svml_sexp10_data_internal(%rip), %xmm5 + movslq %ecx, %rcx + movslq %esi, %rsi + movslq %edi, %rdi + movd (%r8,%rdx), %xmm0 + movd (%r8,%rcx), %xmm8 + movd (%r8,%rsi), %xmm13 + movd (%r8,%rdi), %xmm12 + punpckldq %xmm8, %xmm0 + punpckldq %xmm12, %xmm13 + punpcklqdq %xmm13, %xmm0 + +/* Tj_l+Tj_h*poly */ + mulps %xmm0, %xmm5 + pcmpgtd _iDomainRange+__svml_sexp10_data_internal(%rip), %xmm6 + addps %xmm5, %xmm0 + movmskps %xmm6, %eax + +/* quick mul 2^N */ + paddd %xmm3, %xmm0 + +/* Finish */ + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax xmm0 xmm4 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm4, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 eax + + xorl %edx, %edx + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %edx, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %eax, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call exp10f@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_exp10f_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_sexp10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _sT[(1<<5)][1]; + __declspec(align(16)) VUINT32 _sLg2_10[4][1]; + __declspec(align(16)) VUINT32 _sShifter[4][1]; + __declspec(align(16)) VUINT32 _sInvLg2_10hi[4][1]; + __declspec(align(16)) VUINT32 _sInvLg2_10lo[4][1]; + __declspec(align(16)) VUINT32 _sPC0[4][1]; + __declspec(align(16)) VUINT32 _sPC1[4][1]; + __declspec(align(16)) VUINT32 _sPC2[4][1]; + __declspec(align(16)) VUINT32 _iIndexMask[4][1]; + __declspec(align(16)) VUINT32 _iAbsMask[4][1]; + __declspec(align(16)) VUINT32 _iDomainRange[4][1]; +} __svml_sexp10_data_internal; +#endif +__svml_sexp10_data_internal: + /*== _sT ==*/ + .long 0x3f800000 // 2^( 0 /32 ) + .long 0x3f82cd87 // 2^( 1 /32 ) + .long 0x3f85aac3 // 2^( 2 /32 ) + .long 0x3f88980f // 2^( 3 /32 ) + .long 0x3f8b95c2 // 2^( 4 /32 ) + .long 0x3f8ea43a // 2^( 5 /32 ) + .long 0x3f91c3d3 // 2^( 6 /32 ) + .long 0x3f94f4f0 // 2^( 7 /32 ) + .long 0x3f9837f0 // 2^( 8 /32 ) + .long 0x3f9b8d3a // 2^( 9 /32 ) + .long 0x3f9ef532 // 2^( 10/32 ) + .long 0x3fa27043 // 2^( 11/32 ) + .long 0x3fa5fed7 // 2^( 12/32 ) + .long 0x3fa9a15b // 2^( 13/32 ) + .long 0x3fad583f // 2^( 14/32 ) + .long 0x3fb123f6 // 2^( 15/32 ) + .long 0x3fb504f3 // 2^( 16/32 ) + .long 0x3fb8fbaf // 2^( 17/32 ) + .long 0x3fbd08a4 // 2^( 18/32 ) + .long 0x3fc12c4d // 2^( 19/32 ) + .long 0x3fc5672a // 2^( 20/32 ) + .long 0x3fc9b9be // 2^( 21/32 ) + .long 0x3fce248c // 2^( 22/32 ) + .long 0x3fd2a81e // 2^( 23/32 ) + .long 0x3fd744fd // 2^( 24/32 ) + .long 0x3fdbfbb8 // 2^( 25/32 ) + .long 0x3fe0ccdf // 2^( 26/32 ) + .long 0x3fe5b907 // 2^( 27/32 ) + .long 0x3feac0c7 // 2^( 28/32 ) + .long 0x3fefe4ba // 2^( 29/32 ) + .long 0x3ff5257d // 2^( 30/32 ) + .long 0x3ffa83b3 // 2^( 31/32 ) + .align 16 + .long 0x42d49a78, 0x42d49a78, 0x42d49a78, 0x42d49a78 /* _sLg2_10*2^K */ + .align 16 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter) */ + .align 16 + .long 0x3c1a2000, 0x3c1a2000, 0x3c1a2000, 0x3c1a2000 /* _sInvLg2_10hi/2^K hi (24-K-7) bits*/ + .align 16 + .long 0x341a84fc, 0x341a84fc, 0x341a84fc, 0x341a84fc /* _sInvLg2_10lo/2^K lo bits */ + // otherwise exp10(0) won't produce exact 1.0 + .align 16 + .long 0x2fecc868, 0x2fecc868, 0x2fecc868, 0x2fecc868 /* _sPC0 */ + .align 16 + .long 0x40135e1b, 0x40135e1b, 0x40135e1b, 0x40135e1b /* _sPC1 */ + .align 16 + .long 0x4029a8d2, 0x4029a8d2, 0x4029a8d2, 0x4029a8d2 /* _sPC2 */ + .align 16 + .long 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f /* _iIndexMask =(2^K-1)*/ + //common + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 16 + .long 0x4217b818, 0x4217b818, 0x4217b818, 0x4217b818 /* _iDomainRange=-log10(max_denormal=0x007fffff) RZ */ + .align 16 + .type __svml_sexp10_data_internal,@object + .size __svml_sexp10_data_internal,.-__svml_sexp10_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core-sse.S new file mode 100644 index 0000000000..3f3fe252da --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized exp10f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_exp10f _ZGVdN8v_exp10f_sse_wrapper +#include "../svml_s_exp10f8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core.c new file mode 100644 index 0000000000..1f5ed5a59d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized exp10f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_exp10f +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_exp10f, __GI__ZGVdN8v_exp10f, + __redirect__ZGVdN8v_exp10f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core_avx2.S new file mode 100644 index 0000000000..b576412cf1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_exp10f8_core_avx2.S @@ -0,0 +1,331 @@ +/* Function exp10f vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * exp10(x) = 2^x/log10(2) = 2^n * (1 + T[j]) * (1 + P(y)) + * where + * x = m*log10(2)/K + y, y in [-log10(2)/K..log10(2)/K] + * m = n*K + j, m,n,j - signed integer, j in [-K/2..K/2] + * + * values of 2^j/K are tabulated + * + * P(y) is a minimax polynomial approximation of exp10(x)-1 + * on small interval [-log10(2)/K..log10(2)/K] + * + * Special cases: + * + * exp10(NaN) = NaN + * exp10(+INF) = +INF + * exp10(-INF) = 0 + * exp10(x) = 1 for subnormals + * For IEEE float + * if x > 38.5318412780761720 then exp10f(x) overflow + * if x < -45.4555282592773440 then exp10f(x) underflow + * + */ + +/* Offsets for data table __svml_sexp10_data_internal + */ +#define _sT 0 +#define _sLg2_10 128 +#define _sShifter 160 +#define _sInvLg2_10hi 192 +#define _sInvLg2_10lo 224 +#define _sPC0 256 +#define _sPC1 288 +#define _sPC2 320 +#define _iIndexMask 352 +#define _iAbsMask 384 +#define _iDomainRange 416 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_exp10f_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea __svml_sexp10_data_internal(%rip), %rax + vmovups _sShifter+__svml_sexp10_data_internal(%rip), %ymm4 + +/* Load arument */ + vmovups _sLg2_10+__svml_sexp10_data_internal(%rip), %ymm1 + vmovups _iIndexMask+__svml_sexp10_data_internal(%rip), %ymm2 + vmovaps %ymm0, %ymm3 + vfmadd213ps %ymm4, %ymm3, %ymm1 + +/* Index and lookup */ + vandps %ymm2, %ymm1, %ymm7 + +/* iIndex *= sizeof(S); */ + vpslld $2, %ymm7, %ymm10 + vsubps %ymm4, %ymm1, %ymm0 + +/* Check for overflow\underflow */ + vandps _iAbsMask+__svml_sexp10_data_internal(%rip), %ymm3, %ymm5 + vpcmpgtd _iDomainRange+__svml_sexp10_data_internal(%rip), %ymm5, %ymm6 + vmovmskps %ymm6, %edx + vmovd %xmm10, %ecx + vextractf128 $1, %ymm10, %xmm6 + vpextrd $1, %xmm10, %esi + vpextrd $2, %xmm10, %edi + vpextrd $3, %xmm10, %r8d + movslq %ecx, %rcx + movslq %esi, %rsi + movslq %edi, %rdi + movslq %r8d, %r8 + vmovd (%rax,%rcx), %xmm8 + vmovd (%rax,%rsi), %xmm9 + vmovd (%rax,%rdi), %xmm11 + vmovd (%rax,%r8), %xmm12 + vpunpckldq %xmm9, %xmm8, %xmm13 + vpunpckldq %xmm12, %xmm11, %xmm14 + vpunpcklqdq %xmm14, %xmm13, %xmm15 + +/* R */ + vmovups _sInvLg2_10hi+__svml_sexp10_data_internal(%rip), %ymm13 + vmovd %xmm6, %r9d + vfnmadd213ps %ymm3, %ymm0, %ymm13 + vpextrd $1, %xmm6, %r10d + movslq %r9d, %r9 + movslq %r10d, %r10 + vfnmadd132ps _sInvLg2_10lo+__svml_sexp10_data_internal(%rip), %ymm13, %ymm0 + vmovd (%rax,%r9), %xmm4 + vmovd (%rax,%r10), %xmm5 + vpunpckldq %xmm5, %xmm4, %xmm9 + +/* + * Polynomial + * exp10 = 2^N*(Tj+Tj*poly) + * poly(sN) = {1+later} a0+a1*sR + */ + vmovups _sPC2+__svml_sexp10_data_internal(%rip), %ymm4 + vfmadd213ps _sPC1+__svml_sexp10_data_internal(%rip), %ymm0, %ymm4 + vpextrd $2, %xmm6, %r11d + vpextrd $3, %xmm6, %ecx + movslq %r11d, %r11 + movslq %ecx, %rcx + vfmadd213ps _sPC0+__svml_sexp10_data_internal(%rip), %ymm0, %ymm4 + vmovd (%rax,%r11), %xmm7 + vmovd (%rax,%rcx), %xmm8 + vpunpckldq %xmm8, %xmm7, %xmm11 + +/* remove index bits */ + vpandn %ymm1, %ymm2, %ymm0 + vpunpcklqdq %xmm11, %xmm9, %xmm12 + +/* 2^N */ + vpslld $18, %ymm0, %ymm1 + vinsertf128 $1, %xmm12, %ymm15, %ymm14 + +/* Tj_l+Tj_h*poly */ + vfmadd213ps %ymm14, %ymm14, %ymm4 + +/* quick mul 2^N */ + vpaddd %ymm1, %ymm4, %ymm0 + +/* Finish */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm3, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call exp10f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_exp10f_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_sexp10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _sT[(1<<5)][1]; + __declspec(align(32)) VUINT32 _sLg2_10[8][1]; + __declspec(align(32)) VUINT32 _sShifter[8][1]; + __declspec(align(32)) VUINT32 _sInvLg2_10hi[8][1]; + __declspec(align(32)) VUINT32 _sInvLg2_10lo[8][1]; + __declspec(align(32)) VUINT32 _sPC0[8][1]; + __declspec(align(32)) VUINT32 _sPC1[8][1]; + __declspec(align(32)) VUINT32 _sPC2[8][1]; + __declspec(align(32)) VUINT32 _iIndexMask[8][1]; + __declspec(align(32)) VUINT32 _iAbsMask[8][1]; + __declspec(align(32)) VUINT32 _iDomainRange[8][1]; +} __svml_sexp10_data_internal; +#endif +__svml_sexp10_data_internal: + /*== _sT ==*/ + .long 0x3f800000 // 2^( 0 /32 ) + .long 0x3f82cd87 // 2^( 1 /32 ) + .long 0x3f85aac3 // 2^( 2 /32 ) + .long 0x3f88980f // 2^( 3 /32 ) + .long 0x3f8b95c2 // 2^( 4 /32 ) + .long 0x3f8ea43a // 2^( 5 /32 ) + .long 0x3f91c3d3 // 2^( 6 /32 ) + .long 0x3f94f4f0 // 2^( 7 /32 ) + .long 0x3f9837f0 // 2^( 8 /32 ) + .long 0x3f9b8d3a // 2^( 9 /32 ) + .long 0x3f9ef532 // 2^( 10/32 ) + .long 0x3fa27043 // 2^( 11/32 ) + .long 0x3fa5fed7 // 2^( 12/32 ) + .long 0x3fa9a15b // 2^( 13/32 ) + .long 0x3fad583f // 2^( 14/32 ) + .long 0x3fb123f6 // 2^( 15/32 ) + .long 0x3fb504f3 // 2^( 16/32 ) + .long 0x3fb8fbaf // 2^( 17/32 ) + .long 0x3fbd08a4 // 2^( 18/32 ) + .long 0x3fc12c4d // 2^( 19/32 ) + .long 0x3fc5672a // 2^( 20/32 ) + .long 0x3fc9b9be // 2^( 21/32 ) + .long 0x3fce248c // 2^( 22/32 ) + .long 0x3fd2a81e // 2^( 23/32 ) + .long 0x3fd744fd // 2^( 24/32 ) + .long 0x3fdbfbb8 // 2^( 25/32 ) + .long 0x3fe0ccdf // 2^( 26/32 ) + .long 0x3fe5b907 // 2^( 27/32 ) + .long 0x3feac0c7 // 2^( 28/32 ) + .long 0x3fefe4ba // 2^( 29/32 ) + .long 0x3ff5257d // 2^( 30/32 ) + .long 0x3ffa83b3 // 2^( 31/32 ) + .align 32 + .long 0x42d49a78, 0x42d49a78, 0x42d49a78, 0x42d49a78, 0x42d49a78, 0x42d49a78, 0x42d49a78, 0x42d49a78 /* _sLg2_10*2^K */ + .align 32 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter) */ + .align 32 + .long 0x3c1a2000, 0x3c1a2000, 0x3c1a2000, 0x3c1a2000, 0x3c1a2000, 0x3c1a2000, 0x3c1a2000, 0x3c1a2000 /* _sInvLg2_10hi/2^K hi (24-K-7) bits*/ + .align 32 + .long 0x341a84fc, 0x341a84fc, 0x341a84fc, 0x341a84fc, 0x341a84fc, 0x341a84fc, 0x341a84fc, 0x341a84fc /* _sInvLg2_10lo/2^K lo bits */ + // otherwise exp10(0) won't produce exact 1.0 + .align 32 + .long 0x2fecc868, 0x2fecc868, 0x2fecc868, 0x2fecc868, 0x2fecc868, 0x2fecc868, 0x2fecc868, 0x2fecc868 /* _sPC0 */ + .align 32 + .long 0x40135e1b, 0x40135e1b, 0x40135e1b, 0x40135e1b, 0x40135e1b, 0x40135e1b, 0x40135e1b, 0x40135e1b /* _sPC1 */ + .align 32 + .long 0x4029a8d2, 0x4029a8d2, 0x4029a8d2, 0x4029a8d2, 0x4029a8d2, 0x4029a8d2, 0x4029a8d2, 0x4029a8d2 /* _sPC2 */ + .align 32 + .long 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f /* _iIndexMask =(2^K-1)*/ + //common + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 32 + .long 0x4217b818, 0x4217b818, 0x4217b818, 0x4217b818, 0x4217b818, 0x4217b818, 0x4217b818, 0x4217b818 /* _iDomainRange=-log10(max_denormal=0x007fffff) RZ */ + .align 32 + .type __svml_sexp10_data_internal,@object + .size __svml_sexp10_data_internal,.-__svml_sexp10_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_exp102_core.S b/sysdeps/x86_64/fpu/svml_d_exp102_core.S new file mode 100644 index 0000000000..157fb3b7c0 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp102_core.S @@ -0,0 +1,29 @@ +/* Function exp10 vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_exp10) +WRAPPER_IMPL_SSE2 exp10 +END (_ZGVbN2v_exp10) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_exp10) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_exp104_core.S b/sysdeps/x86_64/fpu/svml_d_exp104_core.S new file mode 100644 index 0000000000..9b9d0a5d4b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp104_core.S @@ -0,0 +1,29 @@ +/* Function exp10 vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_exp10) +WRAPPER_IMPL_AVX _ZGVbN2v_exp10 +END (_ZGVdN4v_exp10) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_exp10) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_exp104_core_avx.S b/sysdeps/x86_64/fpu/svml_d_exp104_core_avx.S new file mode 100644 index 0000000000..1ba1a819ed --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp104_core_avx.S @@ -0,0 +1,25 @@ +/* Function exp10 vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_exp10) +WRAPPER_IMPL_AVX _ZGVbN2v_exp10 +END (_ZGVcN4v_exp10) diff --git a/sysdeps/x86_64/fpu/svml_d_exp108_core.S b/sysdeps/x86_64/fpu/svml_d_exp108_core.S new file mode 100644 index 0000000000..a530dc12de --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_exp108_core.S @@ -0,0 +1,25 @@ +/* Function exp10 vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_exp10) +WRAPPER_IMPL_AVX512 _ZGVdN4v_exp10 +END (_ZGVeN8v_exp10) diff --git a/sysdeps/x86_64/fpu/svml_s_exp10f16_core.S b/sysdeps/x86_64/fpu/svml_s_exp10f16_core.S new file mode 100644 index 0000000000..e5043bc875 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp10f16_core.S @@ -0,0 +1,25 @@ +/* Function exp10f vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_exp10f) +WRAPPER_IMPL_AVX512 _ZGVdN8v_exp10f +END (_ZGVeN16v_exp10f) diff --git a/sysdeps/x86_64/fpu/svml_s_exp10f4_core.S b/sysdeps/x86_64/fpu/svml_s_exp10f4_core.S new file mode 100644 index 0000000000..75e6637a82 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp10f4_core.S @@ -0,0 +1,29 @@ +/* Function exp10f vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_exp10f) +WRAPPER_IMPL_SSE2 exp10f +END (_ZGVbN4v_exp10f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_exp10f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_exp10f8_core.S b/sysdeps/x86_64/fpu/svml_s_exp10f8_core.S new file mode 100644 index 0000000000..d481d2dee9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp10f8_core.S @@ -0,0 +1,29 @@ +/* Function exp10f vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_exp10f) +WRAPPER_IMPL_AVX _ZGVbN4v_exp10f +END (_ZGVdN8v_exp10f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_exp10f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_exp10f8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_exp10f8_core_avx.S new file mode 100644 index 0000000000..65944bd4d2 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_exp10f8_core_avx.S @@ -0,0 +1,25 @@ +/* Function exp10f vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_exp10f) +WRAPPER_IMPL_AVX _ZGVbN4v_exp10f +END (_ZGVcN8v_exp10f) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx.c new file mode 100644 index 0000000000..7cdda9895b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-exp10.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx2.c new file mode 100644 index 0000000000..7cdda9895b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-exp10.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx512f.c new file mode 100644 index 0000000000..7cdda9895b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp10-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-exp10.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-exp10.c b/sysdeps/x86_64/fpu/test-double-libmvec-exp10.c new file mode 100644 index 0000000000..b1461ed85e --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-exp10.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC exp10 +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 2f7172bd7b..256e8f07c9 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVbN2v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVbN2v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVbN2vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVbN2v_exp2) +VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVbN2v_exp10) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index e2d519faac..9de1dab2c2 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVdN4v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVdN4v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVdN4vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVdN4v_exp2) +VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVdN4v_exp10) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 1ce4d8b413..43865ab099 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVcN4v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVcN4v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVcN4vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVcN4v_exp2) +VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVcN4v_exp10) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 6c87cec648..5dbdacf617 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atan), _ZGVeN8v_atan) VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVeN8v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVeN8vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVeN8v_exp2) +VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVeN8v_exp10) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx.c new file mode 100644 index 0000000000..be3cdaa80d --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-exp10f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx2.c new file mode 100644 index 0000000000..be3cdaa80d --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-exp10f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx512f.c new file mode 100644 index 0000000000..be3cdaa80d --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-exp10f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-exp10f.c b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f.c new file mode 100644 index 0000000000..06f447eb8d --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-exp10f.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC exp10f +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 597d7d7598..c159c8f583 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVeN16v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVeN16v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVeN16vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVeN16v_exp2f) +VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVeN16v_exp10f) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 3500eec810..c745ef744a 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVbN4v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVbN4v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVbN4vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVbN4v_exp2f) +VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVbN4v_exp10f) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 921b9c65d6..c9226cf4dc 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVdN8v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVdN8v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVdN8vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVdN8v_exp2f) +VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVdN8v_exp10f) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 6cbcb57521..92970c5ace 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -32,6 +32,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanf), _ZGVcN8v_atanf) VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVcN8v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVcN8vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVcN8v_exp2f) +VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVcN8v_exp10f) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573821 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=NG2/vkBC; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmDM1zzpz9sVq for ; Wed, 29 Dec 2021 07:22:27 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E15363858404 for ; Tue, 28 Dec 2021 20:22:24 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E15363858404 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722944; bh=HX1KIaJwmTgS1eBI6MigW3Ag7Sr9rLm64FGSQTxK/II=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=NG2/vkBCsKYV2vUm2wxlMmdDFQCpfMhPGj1k9Lfsm2L6h4Pdmuu1dF85ResvkQXah /qJCmNARWke+qfk+Om/HiFKDtS4TaRpVYnPOZHJt3PX8QWQUdJLONASrEeRqkC+lnc H6NflpWC0oiHJwIAzhSmBpSFr6gA+Jgu08SozEJo= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by sourceware.org (Postfix) with ESMTPS id E0A2F385843B for ; Tue, 28 Dec 2021 20:11:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E0A2F385843B X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="302169138" X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="302169138" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="609402448" Received: from scymds01.sc.intel.com ([10.148.94.138]) by FMSMGA003.fm.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsa016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 06/18] x86-64: Add vector cosh/coshf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:18 -0800 Message-Id: <20211228201130.737370-7-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized cosh/coshf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector cosh/coshf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_cosh2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_cosh2_core.c | 27 ++ .../fpu/multiarch/svml_d_cosh2_core_sse4.S | 396 +++++++++++++++++ .../fpu/multiarch/svml_d_cosh4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_cosh4_core.c | 27 ++ .../fpu/multiarch/svml_d_cosh4_core_avx2.S | 412 ++++++++++++++++++ .../fpu/multiarch/svml_d_cosh8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_cosh8_core.c | 27 ++ .../fpu/multiarch/svml_d_cosh8_core_avx512.S | 323 ++++++++++++++ .../fpu/multiarch/svml_s_coshf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_coshf16_core.c | 28 ++ .../multiarch/svml_s_coshf16_core_avx512.S | 321 ++++++++++++++ .../fpu/multiarch/svml_s_coshf4_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_s_coshf4_core.c | 28 ++ .../fpu/multiarch/svml_s_coshf4_core_sse4.S | 305 +++++++++++++ .../fpu/multiarch/svml_s_coshf8_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_s_coshf8_core.c | 28 ++ .../fpu/multiarch/svml_s_coshf8_core_avx2.S | 308 +++++++++++++ sysdeps/x86_64/fpu/svml_d_cosh2_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_cosh4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_cosh4_core_avx.S | 25 ++ sysdeps/x86_64/fpu/svml_d_cosh8_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_coshf16_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_coshf4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_coshf8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_coshf8_core_avx.S | 25 ++ .../x86_64/fpu/test-double-libmvec-cosh-avx.c | 1 + .../fpu/test-double-libmvec-cosh-avx2.c | 1 + .../fpu/test-double-libmvec-cosh-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-cosh.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-coshf-avx.c | 1 + .../fpu/test-float-libmvec-coshf-avx2.c | 1 + .../fpu/test-float-libmvec-coshf-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-coshf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 2637 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cosh2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cosh4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cosh4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cosh8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_coshf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_coshf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_coshf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_coshf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cosh.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-coshf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index bc18621f17..35c6ac57a8 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -164,4 +164,15 @@ #define __DECL_SIMD_exp10f32x #define __DECL_SIMD_exp10f64x #define __DECL_SIMD_exp10f128x + +#define __DECL_SIMD_cosh +#define __DECL_SIMD_coshf +#define __DECL_SIMD_coshl +#define __DECL_SIMD_coshf16 +#define __DECL_SIMD_coshf32 +#define __DECL_SIMD_coshf64 +#define __DECL_SIMD_coshf128 +#define __DECL_SIMD_coshf32x +#define __DECL_SIMD_coshf64x +#define __DECL_SIMD_coshf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 870778457f..60a314f69e 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -68,7 +68,7 @@ __MATHCALL (tan,, (_Mdouble_ __x)); /* Hyperbolic functions. */ /* Hyperbolic cosine of X. */ -__MATHCALL (cosh,, (_Mdouble_ __x)); +__MATHCALL_VEC (cosh,, (_Mdouble_ __x)); /* Hyperbolic sine of X. */ __MATHCALL (sinh,, (_Mdouble_ __x)); /* Hyperbolic tangent of X. */ diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index b3c1f59593..4907680143 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -49,48 +49,56 @@ GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F +GLIBC_2.35 _ZGVbN2v_cosh F GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F +GLIBC_2.35 _ZGVbN4v_coshf F GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F +GLIBC_2.35 _ZGVcN4v_cosh F GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F +GLIBC_2.35 _ZGVcN8v_coshf F GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F +GLIBC_2.35 _ZGVdN4v_cosh F GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F +GLIBC_2.35 _ZGVdN8v_coshf F GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F +GLIBC_2.35 _ZGVeN16v_coshf F GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F +GLIBC_2.35 _ZGVeN8v_cosh F GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index f3f9c2e092..708e81b3d0 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -82,6 +82,10 @@ # define __DECL_SIMD_exp10 __DECL_SIMD_x86_64 # undef __DECL_SIMD_exp10f # define __DECL_SIMD_exp10f __DECL_SIMD_x86_64 +# undef __DECL_SIMD_cosh +# define __DECL_SIMD_cosh __DECL_SIMD_x86_64 +# undef __DECL_SIMD_coshf +# define __DECL_SIMD_coshf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index c033abbedc..81d0238ebf 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -40,6 +40,8 @@ !GCC$ builtin (exp2f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (exp10) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (exp10f) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (cosh) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (coshf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -65,3 +67,5 @@ !GCC$ builtin (exp2f) attributes simd (notinbranch) if('x32') !GCC$ builtin (exp10) attributes simd (notinbranch) if('x32') !GCC$ builtin (exp10f) attributes simd (notinbranch) if('x32') +!GCC$ builtin (cosh) attributes simd (notinbranch) if('x32') +!GCC$ builtin (coshf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index fd0a9da439..5bc2df134f 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -26,6 +26,7 @@ libmvec-funcs = \ asin \ atan \ cos \ + cosh \ exp \ exp10 \ exp2 \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index f29cfa4cbf..53346d16a2 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -17,12 +17,14 @@ libmvec { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; + _ZGVbN2v_cosh; _ZGVcN4v_cosh; _ZGVdN4v_cosh; _ZGVeN8v_cosh; _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; + _ZGVbN4v_coshf; _ZGVcN8v_coshf; _ZGVdN8v_coshf; _ZGVeN16v_coshf; _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index 45f2e4bb53..ac70f15208 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -891,6 +891,26 @@ float: 2 float128: 3 ldouble: 3 +Function: "cosh_vlen16": +float: 2 + +Function: "cosh_vlen2": +double: 2 + +Function: "cosh_vlen4": +double: 2 +float: 2 + +Function: "cosh_vlen4_avx2": +double: 2 + +Function: "cosh_vlen8": +double: 2 +float: 2 + +Function: "cosh_vlen8_avx2": +float: 2 + Function: Real part of "cpow": double: 2 float: 5 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core-sse2.S new file mode 100644 index 0000000000..bfe4e3d0f0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized cosh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_cosh _ZGVbN2v_cosh_sse2 +#include "../svml_d_cosh2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core.c new file mode 100644 index 0000000000..99561fea47 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized cosh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_cosh +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_cosh, __GI__ZGVbN2v_cosh, __redirect__ZGVbN2v_cosh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core_sse4.S new file mode 100644 index 0000000000..150bfae7e1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh2_core_sse4.S @@ -0,0 +1,396 @@ +/* Function cosh vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute cosh(x) as (exp(x)+exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * cosh(NaN) = quiet NaN, and raise invalid exception + * cosh(INF) = that INF + * cosh(0) = 1 + * cosh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_dcosh_data_internal + */ +#define _dbT 0 +#define _dbInvLn2 2064 +#define _dbLn2hi 2080 +#define _dbLn2lo 2096 +#define _dbShifter 2112 +#define _iIndexMask 2128 +#define _dPC2 2144 +#define _dPC3 2160 +#define _dPC4 2176 +#define _iMaxIndex 2192 +#define _lExpMask 2208 +#define _dSign 2224 +#define _iDomainRange 2240 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_cosh_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm4 + movups _dSign+__svml_dcosh_data_internal(%rip), %xmm2 + lea _dbT+__svml_dcosh_data_internal(%rip), %r8 + +/* Abs argument */ + movaps %xmm2, %xmm5 + +/* dXSign=0x001000000000 */ + psrlq $11, %xmm2 + +/* + * Load argument + * dM = x*2^K/log(2) + RShifter + */ + movups _dbInvLn2+__svml_dcosh_data_internal(%rip), %xmm3 + andnps %xmm4, %xmm5 + mulpd %xmm5, %xmm3 + movups _dbShifter+__svml_dcosh_data_internal(%rip), %xmm1 + addpd %xmm1, %xmm3 + +/* + * R + * dN = dM - RShifter + */ + movaps %xmm3, %xmm15 + subpd %xmm1, %xmm15 + +/* dR = dX - dN*Log2_hi/2^K */ + movups _dbLn2hi+__svml_dcosh_data_internal(%rip), %xmm14 + mulpd %xmm15, %xmm14 + +/* dR = (dX - dN*Log2_hi/2^K) - dN*Log2_lo/2^K */ + movups _dbLn2lo+__svml_dcosh_data_internal(%rip), %xmm1 + mulpd %xmm15, %xmm1 + +/* + * Check for overflow\underflow + * + */ + pshufd $221, %xmm5, %xmm7 + subpd %xmm14, %xmm5 + movq _iIndexMask+__svml_dcosh_data_internal(%rip), %xmm8 + +/* Index and lookup */ + pshufd $136, %xmm3, %xmm9 + +/* + * G1,G2,G3: dTdif,dTn * 2^N,2^(-N) + * NB: copied from sinh_la - to be optimized!!!!! + */ + psllq $44, %xmm3 + +/* + * trick + * 256=-iIndex + */ + movq _iMaxIndex+__svml_dcosh_data_internal(%rip), %xmm12 + pand %xmm8, %xmm9 + subpd %xmm1, %xmm5 + psubd %xmm9, %xmm12 + +/* iIndex*=3 */ + movdqa %xmm9, %xmm10 + +/* iDomainRange*=3 */ + pslld $3, %xmm12 + pslld $3, %xmm10 + movd %xmm12, %esi + pshufd $1, %xmm12, %xmm13 + movq _iDomainRange+__svml_dcosh_data_internal(%rip), %xmm6 + movd %xmm13, %edi + pcmpgtd %xmm6, %xmm7 + movmskps %xmm7, %eax + +/* dR2 = dR^2 */ + movaps %xmm5, %xmm7 + +/* lM now is an EXP(2^N) */ + pand _lExpMask+__svml_dcosh_data_internal(%rip), %xmm3 + pshufd $1, %xmm10, %xmm11 + movslq %esi, %rsi + mulpd %xmm5, %xmm7 + movd %xmm10, %edx + movsd (%r8,%rsi), %xmm6 + movd %xmm11, %ecx + movslq %edi, %rdi + movslq %edx, %rdx + movslq %ecx, %rcx + movhpd (%r8,%rdi), %xmm6 + +/* */ + psubq %xmm3, %xmm6 + +/* lX- = EXP(1/2) */ + psubq %xmm2, %xmm6 + +/* + * sinh(r) = r +r*r^2*a3 .... + * dSinh_r = r^2*a3 + */ + movups _dPC3+__svml_dcosh_data_internal(%rip), %xmm2 + mulpd %xmm7, %xmm2 + +/* dSinh_r = r + r*r^2*a3 */ + mulpd %xmm5, %xmm2 + movsd (%r8,%rdx), %xmm0 + movhpd (%r8,%rcx), %xmm0 + paddq %xmm3, %xmm0 + addpd %xmm2, %xmm5 + +/* dTn = dTn*2^N - dTn*2^-N */ + movaps %xmm0, %xmm3 + subpd %xmm6, %xmm3 + +/* dTp = dTn*2^N + dTn*2^-N */ + addpd %xmm6, %xmm0 + mulpd %xmm5, %xmm3 + +/* poly(r) = dTp + dTn*sinh(dR)+dTp*dR2*(a2 +a4*dR2) */ + movups _dPC4+__svml_dcosh_data_internal(%rip), %xmm5 + mulpd %xmm7, %xmm5 + addpd _dPC2+__svml_dcosh_data_internal(%rip), %xmm5 + mulpd %xmm5, %xmm7 + +/* dTn*sinh(dR)+dTp*dR2*(a2 +a4*dR2) */ + mulpd %xmm0, %xmm7 + addpd %xmm7, %xmm3 + +/* _VRES1 = dTp + dTn*sinh(dR)+dTp*dR2*(a2 +a4*dR2) */ + addpd %xmm3, %xmm0 + andl $3, %eax + +/* Ret H */ + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax xmm0 xmm4 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm4, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 eax xmm0 + + xorl %edx, %edx + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %edx, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %eax, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call cosh@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2v_cosh_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dcosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dbT[(1 + (1<<8))][2]; //dTpj ONLY! + __declspec(align(16)) VUINT32 _dbInvLn2[2][2]; + __declspec(align(16)) VUINT32 _dbLn2hi[2][2]; + __declspec(align(16)) VUINT32 _dbLn2lo[2][2]; + __declspec(align(16)) VUINT32 _dbShifter[2][2]; + __declspec(align(16)) VUINT32 _iIndexMask[4][1]; //(1<. */ + +#define _ZGVdN4v_cosh _ZGVdN4v_cosh_sse_wrapper +#include "../svml_d_cosh4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core.c new file mode 100644 index 0000000000..c4f59206a9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized cosh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_cosh +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_cosh, __GI__ZGVdN4v_cosh, __redirect__ZGVdN4v_cosh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core_avx2.S new file mode 100644 index 0000000000..2d86a02923 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh4_core_avx2.S @@ -0,0 +1,412 @@ +/* Function cosh vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute cosh(x) as (exp(x)+exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * cosh(NaN) = quiet NaN, and raise invalid exception + * cosh(INF) = that INF + * cosh(0) = 1 + * cosh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_dcosh_data_internal + */ +#define _dbT 0 +#define _dbInvLn2 2080 +#define _dbLn2hi 2112 +#define _dbLn2lo 2144 +#define _dbShifter 2176 +#define _iIndexMask 2208 +#define _dPC2 2240 +#define _dPC3 2272 +#define _dPC4 2304 +#define _iMaxIndex 2336 +#define _lExpMask 2368 +#define _dSign 2400 +#define _iDomainRange 2432 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_cosh_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea _dbT+__svml_dcosh_data_internal(%rip), %rax + vmovupd _dSign+__svml_dcosh_data_internal(%rip), %ymm8 + vmovupd _dbShifter+__svml_dcosh_data_internal(%rip), %ymm6 + +/* + * Load argument + * dM = x*2^K/log(2) + RShifter + */ + vmovupd _dbInvLn2+__svml_dcosh_data_internal(%rip), %ymm3 + +/* + * trick + * 256=-iIndex + */ + vmovups _iMaxIndex+__svml_dcosh_data_internal(%rip), %xmm14 + +/* dXSign=0x001000000000 */ + vpsrlq $11, %ymm8, %ymm5 + vmovapd %ymm0, %ymm7 + +/* Abs argument */ + vandnpd %ymm7, %ymm8, %ymm4 + vfmadd213pd %ymm6, %ymm4, %ymm3 + +/* Index and lookup */ + vextractf128 $1, %ymm3, %xmm12 + vshufps $136, %xmm12, %xmm3, %xmm13 + vpand _iIndexMask+__svml_dcosh_data_internal(%rip), %xmm13, %xmm15 + vpsubd %xmm15, %xmm14, %xmm0 + +/* iDomainRange*=3 */ + vpslld $3, %xmm0, %xmm2 + vmovd %xmm2, %r9d + vpextrd $2, %xmm2, %r11d + movslq %r9d, %r9 + vpextrd $1, %xmm2, %r10d + movslq %r11d, %r11 + movslq %r10d, %r10 + vmovsd (%rax,%r9), %xmm12 + +/* + * Check for overflow\underflow + * + */ + vextractf128 $1, %ymm4, %xmm9 + vmovsd (%rax,%r11), %xmm14 + vmovhpd (%rax,%r10), %xmm12, %xmm13 + vshufps $221, %xmm9, %xmm4, %xmm10 + +/* iIndex*=3 */ + vpslld $3, %xmm15, %xmm9 + +/* + * R + * dN = dM - RShifter + */ + vsubpd %ymm6, %ymm3, %ymm15 + vmovd %xmm9, %ecx + vpcmpgtd _iDomainRange+__svml_dcosh_data_internal(%rip), %xmm10, %xmm11 + vmovupd _dbLn2hi+__svml_dcosh_data_internal(%rip), %ymm6 + +/* + * G1,G2,G3: dTdif,dTn * 2^N,2^(-N) + * NB: copied from sinh_la - to be optimized!!!!! + */ + vpsllq $44, %ymm3, %ymm3 + vmovmskps %xmm11, %edx + +/* dR = dX - dN*Log2_hi/2^K */ + vfnmadd231pd %ymm6, %ymm15, %ymm4 + +/* lM now is an EXP(2^N) */ + vpand _lExpMask+__svml_dcosh_data_internal(%rip), %ymm3, %ymm3 + +/* dR = (dX - dN*Log2_hi/2^K) - dN*Log2_lo/2^K */ + vfnmadd231pd _dbLn2lo+__svml_dcosh_data_internal(%rip), %ymm15, %ymm4 + movslq %ecx, %rcx + vpextrd $2, %xmm9, %edi + vpextrd $1, %xmm9, %esi + movslq %edi, %rdi + vmovsd (%rax,%rcx), %xmm1 + vpextrd $3, %xmm9, %r8d + vpextrd $3, %xmm2, %ecx + movslq %esi, %rsi + movslq %r8d, %r8 + movslq %ecx, %rcx + +/* dR2 = dR^2 */ + vmulpd %ymm4, %ymm4, %ymm0 + vmovsd (%rax,%rdi), %xmm10 + vmovhpd (%rax,%rsi), %xmm1, %xmm8 + vmovhpd (%rax,%r8), %xmm10, %xmm11 + vmovhpd (%rax,%rcx), %xmm14, %xmm2 + vinsertf128 $1, %xmm11, %ymm8, %ymm1 + vinsertf128 $1, %xmm2, %ymm13, %ymm2 + vpaddq %ymm3, %ymm1, %ymm6 + +/* */ + vpsubq %ymm3, %ymm2, %ymm1 + +/* + * sinh(r) = r +r*r^2*a3 .... + * dSinh_r = r^2*a3 + */ + vmulpd _dPC3+__svml_dcosh_data_internal(%rip), %ymm0, %ymm2 + +/* lX- = EXP(1/2) */ + vpsubq %ymm5, %ymm1, %ymm5 + +/* dSinh_r = r + r*r^2*a3 */ + vfmadd213pd %ymm4, %ymm4, %ymm2 + +/* poly(r) = dTp + dTn*sinh(dR)+dTp*dR2*(a2 +a4*dR2) */ + vmovupd _dPC4+__svml_dcosh_data_internal(%rip), %ymm4 + +/* dTn = dTn*2^N - dTn*2^-N */ + vsubpd %ymm5, %ymm6, %ymm1 + +/* dTp = dTn*2^N + dTn*2^-N */ + vaddpd %ymm5, %ymm6, %ymm3 + vfmadd213pd _dPC2+__svml_dcosh_data_internal(%rip), %ymm0, %ymm4 + vmulpd %ymm2, %ymm1, %ymm1 + vmulpd %ymm4, %ymm0, %ymm0 + +/* dTn*sinh(dR)+dTp*dR2*(a2 +a4*dR2) */ + vfmadd213pd %ymm1, %ymm3, %ymm0 + +/* _VRES1 = dTp + dTn*sinh(dR)+dTp*dR2*(a2 +a4*dR2) */ + vaddpd %ymm0, %ymm3, %ymm0 + +/* Ret H */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm7 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm7, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call cosh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_cosh_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dcosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dbT[(1 + (1<<8))][2]; //dTpj ONLY! + __declspec(align(32)) VUINT32 _dbInvLn2[4][2]; + __declspec(align(32)) VUINT32 _dbLn2hi[4][2]; + __declspec(align(32)) VUINT32 _dbLn2lo[4][2]; + __declspec(align(32)) VUINT32 _dbShifter[4][2]; + __declspec(align(32)) VUINT32 _iIndexMask[8][1]; //(1<. */ + +#define _ZGVeN8v_cosh _ZGVeN8v_cosh_avx2_wrapper +#include "../svml_d_cosh8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core.c new file mode 100644 index 0000000000..576b3186d5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized cosh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_cosh +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_cosh, __GI__ZGVeN8v_cosh, __redirect__ZGVeN8v_cosh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core_avx512.S new file mode 100644 index 0000000000..53040cef9a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cosh8_core_avx512.S @@ -0,0 +1,323 @@ +/* Function cosh vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute cosh(x) as (exp(x)+exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * cosh(NaN) = quiet NaN, and raise invalid exception + * cosh(INF) = that INF + * cosh(0) = 1 + * cosh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_dcosh_data_internal + */ +#define _dTp_h 0 +#define _dTn_h 128 +#define _dbShifter_UISA 256 +#define _dPC2_UISA 320 +#define _dPC3_UISA 384 +#define _dPC4_UISA 448 +#define _dPC5_UISA 512 +#define _dPC6_UISA 576 +#define _dPC7_UISA 640 +#define _dbInvLn2 704 +#define _dbLn2hi 768 +#define _dbLn2lo 832 +#define _dbShifter 896 +#define _dPC2 960 +#define _dPC3 1024 +#define _dPC4 1088 +#define _lExpMask 1152 +#define _dSign 1216 +#define _iDomainRange 1280 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_cosh_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups _dSign+__svml_dcosh_data_internal(%rip), %zmm11 + vmovups _dbShifter_UISA+__svml_dcosh_data_internal(%rip), %zmm15 + +/* + * Load argument + * dM = x*2^K/log(2) + RShifter + */ + vmovups _dbInvLn2+__svml_dcosh_data_internal(%rip), %zmm4 + vmovups _dbLn2hi+__svml_dcosh_data_internal(%rip), %zmm2 + vmovups _dbLn2lo+__svml_dcosh_data_internal(%rip), %zmm3 + vmovups _dPC7_UISA+__svml_dcosh_data_internal(%rip), %zmm8 + vmovups _dPC6_UISA+__svml_dcosh_data_internal(%rip), %zmm9 + vmovups _dPC2_UISA+__svml_dcosh_data_internal(%rip), %zmm7 + vmovups _dPC3_UISA+__svml_dcosh_data_internal(%rip), %zmm6 + vmovaps %zmm0, %zmm10 + +/* Abs argument */ + vandnpd %zmm10, %zmm11, %zmm5 + +/* Index and lookup */ + vmovups __svml_dcosh_data_internal(%rip), %zmm11 + vmovups _dTn_h+__svml_dcosh_data_internal(%rip), %zmm0 + vfmadd213pd {rn-sae}, %zmm15, %zmm5, %zmm4 + +/* + * Check for overflow\underflow + * + */ + vpsrlq $32, %zmm5, %zmm12 + +/* dN = dM - RShifter */ + vsubpd {rn-sae}, %zmm15, %zmm4, %zmm1 + vpmovqd %zmm12, %ymm13 + vpermt2pd _dTn_h+64+__svml_dcosh_data_internal(%rip), %zmm4, %zmm0 + vpermt2pd _dTp_h+64+__svml_dcosh_data_internal(%rip), %zmm4, %zmm11 + +/* dR = dX - dN*Log2_hi/2^K */ + vfnmadd231pd {rn-sae}, %zmm2, %zmm1, %zmm5 + +/* + * poly(r) = Gmjp(1 + a2*r^2 + a4*r^4) + Gmjn*(r+ a3*r^3 +a5*r^5) = + * = Gmjp_h +Gmjp_l+ Gmjp*r^2*(a2 + a4*r^2) + Gmjn*(r+ r^3*(a3 +a5*r^2) + */ + vmovups _dPC5_UISA+__svml_dcosh_data_internal(%rip), %zmm12 + vpsllq $48, %zmm4, %zmm2 + +/* dR = dX - dN*Log2_hi/2^K */ + vfnmadd231pd {rn-sae}, %zmm3, %zmm1, %zmm5 + vmulpd {rn-sae}, %zmm5, %zmm5, %zmm1 + vfmadd231pd {rn-sae}, %zmm1, %zmm8, %zmm12 + vmovups _dPC4_UISA+__svml_dcosh_data_internal(%rip), %zmm8 + vfmadd213pd {rn-sae}, %zmm6, %zmm1, %zmm12 + vfmadd231pd {rn-sae}, %zmm1, %zmm9, %zmm8 + vfmadd213pd {rn-sae}, %zmm7, %zmm1, %zmm8 + vpcmpgtd _iDomainRange+__svml_dcosh_data_internal(%rip), %ymm13, %ymm14 + vmovmskps %ymm14, %edx + +/* dOut=r^2*(a2 + a4*r^2) */ + vmulpd {rn-sae}, %zmm1, %zmm8, %zmm6 + +/* lM now is an EXP(2^N) */ + vpandq _lExpMask+__svml_dcosh_data_internal(%rip), %zmm2, %zmm3 + vpaddq %zmm3, %zmm11, %zmm4 + vpsubq %zmm3, %zmm0, %zmm0 + vsubpd {rn-sae}, %zmm0, %zmm4, %zmm14 + vaddpd {rn-sae}, %zmm0, %zmm4, %zmm13 + +/* dM=r^2*(a3 +a5*r^2) */ + vmulpd {rn-sae}, %zmm1, %zmm12, %zmm0 + vfmadd213pd {rn-sae}, %zmm13, %zmm13, %zmm6 + +/* dM= r + r^3*(a3 +a5*r^2) */ + vfmadd213pd {rn-sae}, %zmm5, %zmm5, %zmm0 + vfmadd213pd {rn-sae}, %zmm6, %zmm14, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm10 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm10, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call cosh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_cosh_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dcosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _dTp_h[(1<<4)][2]; + __declspec(align(64)) VUINT32 _dTn_h[(1<<4)][2]; + __declspec(align(64)) VUINT32 _dbShifter_UISA[8][2]; + __declspec(align(64)) VUINT32 _dPC2_UISA[8][2]; + __declspec(align(64)) VUINT32 _dPC3_UISA[8][2]; + __declspec(align(64)) VUINT32 _dPC4_UISA[8][2]; + __declspec(align(64)) VUINT32 _dPC5_UISA[8][2]; + __declspec(align(64)) VUINT32 _dPC6_UISA[8][2]; + __declspec(align(64)) VUINT32 _dPC7_UISA[8][2]; + __declspec(align(64)) VUINT32 _dbInvLn2[8][2]; + __declspec(align(64)) VUINT32 _dbLn2hi[8][2]; + __declspec(align(64)) VUINT32 _dbLn2lo[8][2]; + __declspec(align(64)) VUINT32 _dbShifter[8][2]; + __declspec(align(64)) VUINT32 _dPC2[8][2]; + __declspec(align(64)) VUINT32 _dPC3[8][2]; + __declspec(align(64)) VUINT32 _dPC4[8][2]; + __declspec(align(64)) VUINT32 _lExpMask[8][2]; + __declspec(align(64)) VUINT32 _dSign[8][2]; //0x8000000000000000 + __declspec(align(64)) VUINT32 _iDomainRange[16][1]; +} __svml_dcosh_data_internal; +#endif +__svml_dcosh_data_internal: + /*== _dTp_h ==*/ + .quad 0x3fe0000000000000, 0x3fe0b5586cf9890f, 0x3fe172b83c7d517b, 0x3fe2387a6e756238 + .quad 0x3fe306fe0a31b715, 0x3fe3dea64c123422, 0x3fe4bfdad5362a27, 0x3fe5ab07dd485429 + .quad 0x3fe6a09e667f3bcd, 0x3fe7a11473eb0187, 0x3fe8ace5422aa0db, 0x3fe9c49182a3f090 + .quad 0x3feae89f995ad3ad, 0x3fec199bdd85529c, 0x3fed5818dcfba487, 0x3feea4afa2a490da + /*== dTn_h ==*/ + .align 64 + .quad 0x3fe0000000000000, 0x3fdea4afa2a490da, 0x3fdd5818dcfba487, 0x3fdc199bdd85529c + .quad 0x3fdae89f995ad3ad, 0x3fd9c49182a3f090, 0x3fd8ace5422aa0db, 0x3fd7a11473eb0187 + .quad 0x3fd6a09e667f3bcd, 0x3fd5ab07dd485429, 0x3fd4bfdad5362a27, 0x3fd3dea64c123422 + .quad 0x3fd306fe0a31b715, 0x3fd2387a6e756238, 0x3fd172b83c7d517b, 0x3fd0b5586cf9890f + .align 64 + .quad 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000, 0x42F8000000000000 /* _dbShifter_UISA */ + .align 64 + .quad 0x3fe0000000000004, 0x3fe0000000000004, 0x3fe0000000000004, 0x3fe0000000000004, 0x3fe0000000000004, 0x3fe0000000000004, 0x3fe0000000000004, 0x3fe0000000000004 /* _dPC2_UISA */ + .align 64 + .quad 0x3fc5555555555543, 0x3fc5555555555543, 0x3fc5555555555543, 0x3fc5555555555543, 0x3fc5555555555543, 0x3fc5555555555543, 0x3fc5555555555543, 0x3fc5555555555543 /* _dPC3_UISA */ + .align 64 + .quad 0x3fa5555555484f37, 0x3fa5555555484f37, 0x3fa5555555484f37, 0x3fa5555555484f37, 0x3fa5555555484f37, 0x3fa5555555484f37, 0x3fa5555555484f37, 0x3fa5555555484f37 /* _dPC4_UISA */ + .align 64 + .quad 0x3f81111111286a0c, 0x3f81111111286a0c, 0x3f81111111286a0c, 0x3f81111111286a0c, 0x3f81111111286a0c, 0x3f81111111286a0c, 0x3f81111111286a0c, 0x3f81111111286a0c /* _dPC5_UISA */ + .align 64 + .quad 0x3f56c183da08f116, 0x3f56c183da08f116, 0x3f56c183da08f116, 0x3f56c183da08f116, 0x3f56c183da08f116, 0x3f56c183da08f116, 0x3f56c183da08f116, 0x3f56c183da08f116 /* _dPC6_UISA */ + .align 64 + .quad 0x3f2a018d76da03da, 0x3f2a018d76da03da, 0x3f2a018d76da03da, 0x3f2a018d76da03da, 0x3f2a018d76da03da, 0x3f2a018d76da03da, 0x3f2a018d76da03da, 0x3f2a018d76da03da /* _dPC7_UISA */ + /*== _dbT ==*/ + .align 64 + .quad 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe /* _dbInvLn2 = 1/log(2) */ + .align 64 + .quad 0x3FE62E42FEFC0000, 0x3FE62E42FEFC0000, 0x3FE62E42FEFC0000, 0x3FE62E42FEFC0000, 0x3FE62E42FEFC0000, 0x3FE62E42FEFC0000, 0x3FE62E42FEFC0000, 0x3FE62E42FEFC0000 /* _dbLn2hi = log(2) hi*/ + .align 64 + .quad 0xBDAC610CA86C3899, 0xBDAC610CA86C3899, 0xBDAC610CA86C3899, 0xBDAC610CA86C3899, 0xBDAC610CA86C3899, 0xBDAC610CA86C3899, 0xBDAC610CA86C3899, 0xBDAC610CA86C3899 /* _dbLn2lo = log(2) lo*/ + .align 64 + .quad 0x42B8000000000000, 0x42B8000000000000, 0x42B8000000000000, 0x42B8000000000000, 0x42B8000000000000, 0x42B8000000000000, 0x42B8000000000000, 0x42B8000000000000 /* _dbShifter */ + .align 64 + .quad 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD /* _dPC2 */ + .align 64 + .quad 0x3FC5555570813E14, 0x3FC5555570813E14, 0x3FC5555570813E14, 0x3FC5555570813E14, 0x3FC5555570813E14, 0x3FC5555570813E14, 0x3FC5555570813E14, 0x3FC5555570813E14 /* _dPC3 */ + .align 64 + .quad 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299 /* _dPC4 */ + .align 64 + .quad 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000 /* _lExpMask */ + .align 64 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 /* _dSign*/ + .align 64 + .long 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99 /* _iDomainRange 0x40861d9ac12a3e85 =(1021*2^K-0.5)*log(2)/2^K -needed for quick exp*/ + .align 64 + .type __svml_dcosh_data_internal,@object + .size __svml_dcosh_data_internal,.-__svml_dcosh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core-avx2.S new file mode 100644 index 0000000000..456d8a129f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized coshf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_coshf _ZGVeN16v_coshf_avx2_wrapper +#include "../svml_s_coshf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core.c new file mode 100644 index 0000000000..34c008871a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized coshf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_coshf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_coshf, __GI__ZGVeN16v_coshf, + __redirect__ZGVeN16v_coshf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core_avx512.S new file mode 100644 index 0000000000..276e3cfe4d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf16_core_avx512.S @@ -0,0 +1,321 @@ +/* Function coshf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute cosh(x) as (exp(x)+exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * cosh(NaN) = quiet NaN, and raise invalid exception + * cosh(INF) = that INF + * cosh(0) = 1 + * cosh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_scosh_data_internal + */ +#define _sExp_tbl_PH 0 +#define _sExp_tbl_NH 128 +#define _sShifter_UISA 256 +#define _iDomainRange_UISA 320 +#define _sPC1_UISA 384 +#define _sPC2_UISA 448 +#define _sPC3_UISA 512 +#define _sInvLn2 576 +#define _sLn2hi 640 +#define _sLn2lo 704 +#define _sSign 768 +#define _iExpMask 832 +#define _sShifter 896 +#define _iDomainRange 960 +#define _sPC1 1024 +#define _sPC2 1088 +#define _sPC3 1152 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_coshf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups _sSign+__svml_scosh_data_internal(%rip), %zmm4 + vmovups _sShifter_UISA+__svml_scosh_data_internal(%rip), %zmm6 + +/* + * Load argument + * dM = x/log(2) + RShifter + */ + vmovups _sInvLn2+__svml_scosh_data_internal(%rip), %zmm10 + vmovups _sLn2hi+__svml_scosh_data_internal(%rip), %zmm7 + vmovups _sLn2lo+__svml_scosh_data_internal(%rip), %zmm9 + +/* */ + vmovups _sPC3_UISA+__svml_scosh_data_internal(%rip), %zmm2 + +/* x^2 */ + vmovups _sPC2_UISA+__svml_scosh_data_internal(%rip), %zmm3 + +/* G1,G2 2^N,2^(-N) */ + vmovups __svml_scosh_data_internal(%rip), %zmm12 + vmovups _sExp_tbl_NH+__svml_scosh_data_internal(%rip), %zmm13 + +/* + * Implementation + * Abs argument + */ + vandnps %zmm0, %zmm4, %zmm1 + +/* Check for overflow\underflow */ + vpternlogd $255, %zmm5, %zmm5, %zmm5 + vfmadd213ps {rn-sae}, %zmm6, %zmm1, %zmm10 + vpcmpd $1, _iDomainRange_UISA+__svml_scosh_data_internal(%rip), %zmm1, %k1 + +/* iM now is an EXP(2^N) */ + vpslld $18, %zmm10, %zmm11 + +/* + * R + * sN = sM - RShifter + */ + vsubps {rn-sae}, %zmm6, %zmm10, %zmm8 + vpermt2ps _sExp_tbl_PH+64+__svml_scosh_data_internal(%rip), %zmm10, %zmm12 + vpermt2ps _sExp_tbl_NH+64+__svml_scosh_data_internal(%rip), %zmm10, %zmm13 + vpandnd %zmm1, %zmm1, %zmm5{%k1} + +/* sR = sX - sN*Log2_hi */ + vfnmadd231ps {rn-sae}, %zmm7, %zmm8, %zmm1 + vptestmd %zmm5, %zmm5, %k0 + +/* sR = (sX - sN*Log2_hi) - sN*Log2_lo */ + vfnmadd231ps {rn-sae}, %zmm9, %zmm8, %zmm1 + kmovw %k0, %edx + vmulps {rn-sae}, %zmm1, %zmm1, %zmm4 + vmulps {rn-sae}, %zmm4, %zmm2, %zmm2 + +/* sSinh_r = r + r*(r^2*(a3)) */ + vfmadd213ps {rn-sae}, %zmm1, %zmm1, %zmm2 + +/* sOut = r^2*(a2) */ + vmulps {rn-sae}, %zmm4, %zmm3, %zmm1 + vpandd _iExpMask+__svml_scosh_data_internal(%rip), %zmm11, %zmm14 + vpaddd %zmm14, %zmm12, %zmm15 + vpsubd %zmm14, %zmm13, %zmm10 + +/* sG2 = 2^N*Th + 2^(-N)*T_h */ + vaddps {rn-sae}, %zmm10, %zmm15, %zmm5 + +/* sG1 = 2^N*Th - 2^(-N)*T_h */ + vsubps {rn-sae}, %zmm10, %zmm15, %zmm6 + +/* res = sG1*(r + r*(r^2*(a3))) + sG2*(1+r^2*(a2)) */ + vfmadd213ps {rn-sae}, %zmm5, %zmm5, %zmm1 + vfmadd213ps {rn-sae}, %zmm1, %zmm2, %zmm6 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm6 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm6, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm6, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm6 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm6 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm6 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call coshf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_coshf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_scosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _sExp_tbl_PH[32][1]; + __declspec(align(64)) VUINT32 _sExp_tbl_NH[32][1]; + __declspec(align(64)) VUINT32 _sShifter_UISA[16][1]; + __declspec(align(64)) VUINT32 _iDomainRange_UISA[16][1]; + __declspec(align(64)) VUINT32 _sPC1_UISA[16][1]; + __declspec(align(64)) VUINT32 _sPC2_UISA[16][1]; + __declspec(align(64)) VUINT32 _sPC3_UISA[16][1]; + __declspec(align(64)) VUINT32 _sInvLn2[16][1]; + __declspec(align(64)) VUINT32 _sLn2hi[16][1]; + __declspec(align(64)) VUINT32 _sLn2lo[16][1]; + __declspec(align(64)) VUINT32 _sSign[16][1]; + __declspec(align(64)) VUINT32 _iExpMask[16][1]; + __declspec(align(64)) VUINT32 _sShifter[16][1]; + __declspec(align(64)) VUINT32 _iDomainRange[16][1]; + __declspec(align(64)) VUINT32 _sPC1[16][1]; + __declspec(align(64)) VUINT32 _sPC2[16][1]; + __declspec(align(64)) VUINT32 _sPC3[16][1]; +} __svml_scosh_data_internal; +#endif +__svml_scosh_data_internal: + /* _sExp_tbl_PH 2^(i/32-1), i=0..31 */ + .long 0x3f000000, 0x3f02cd87, 0x3f05aac3, 0x3f08980f + .long 0x3f0b95c2, 0x3f0ea43a, 0x3f11c3d3, 0x3f14f4f0 + .long 0x3f1837f0, 0x3f1b8d3a, 0x3f1ef532, 0x3f227043 + .long 0x3f25fed7, 0x3f29a15b, 0x3f2d583f, 0x3f3123f6 + .long 0x3f3504f3, 0x3f38fbaf, 0x3f3d08a4, 0x3f412c4d + .long 0x3f45672a, 0x3f49b9be, 0x3f4e248c, 0x3f52a81e + .long 0x3f5744fd, 0x3f5bfbb8, 0x3f60ccdf, 0x3f65b907 + .long 0x3f6ac0c7, 0x3f6fe4ba, 0x3f75257d, 0x3f7a83b3 + /* _sExp_tbl_NH 2^(-i/32-1), i=0..31 */ + .align 64 + .long 0x3f000000, 0x3efa83b3, 0x3ef5257d, 0x3eefe4ba + .long 0x3eeac0c7, 0x3ee5b907, 0x3ee0ccdf, 0x3edbfbb8 + .long 0x3ed744fd, 0x3ed2a81e, 0x3ece248c, 0x3ec9b9be + .long 0x3ec5672a, 0x3ec12c4d, 0x3ebd08a4, 0x3eb8fbaf + .long 0x3eb504f3, 0x3eb123f6, 0x3ead583f, 0x3ea9a15b + .long 0x3ea5fed7, 0x3ea27043, 0x3e9ef532, 0x3e9b8d3a + .long 0x3e9837f0, 0x3e94f4f0, 0x3e91c3d3, 0x3e8ea43a + .long 0x3e8b95c2, 0x3e88980f, 0x3e85aac3, 0x3e82cd87 + .align 64 + .long 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000 /* 1.5*2^18 _sShifter_UISA */ + .align 64 + .long 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E /* _iDomainRange_UISA */ + .align 64 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC1_UISA=1 */ + .align 64 + .long 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f, 0x3f00010f /* _sPC2_UISA */ + .align 64 + .long 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd, 0x3e2aaacd /* _sPC3_UISA */ + .align 64 + .long 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B /* _sInvLn2 */ //k=0 + .align 64 + .long 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000 /* _sLn2hi */ + .align 64 + .long 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4 /* _sLn2lo */ + .align 64 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSign */ + .align 64 + .long 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000, 0x7f800000 /* _iExpMask */ + .align 64 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 64 + .long 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E /* _iDomainRange */ + .align 64 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC1=1 */ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sPC2 */ + .align 64 + .long 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57 /* _sPC3 */ + .align 64 + .type __svml_scosh_data_internal,@object + .size __svml_scosh_data_internal,.-__svml_scosh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core-sse2.S new file mode 100644 index 0000000000..c719dc7d6a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized coshf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_coshf _ZGVbN4v_coshf_sse2 +#include "../svml_s_coshf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core.c new file mode 100644 index 0000000000..c2dfcd44f8 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized coshf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_coshf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_coshf, __GI__ZGVbN4v_coshf, + __redirect__ZGVbN4v_coshf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core_sse4.S new file mode 100644 index 0000000000..506f6a4bd9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf4_core_sse4.S @@ -0,0 +1,305 @@ +/* Function coshf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute cosh(x) as (exp(x)+exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * cosh(NaN) = quiet NaN, and raise invalid exception + * cosh(INF) = that INF + * cosh(0) = 1 + * cosh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_scosh_data_internal + */ +#define _sInvLn2 0 +#define _sLn2hi 16 +#define _sLn2lo 32 +#define _sSign 48 +#define _sShifter 64 +#define _iDomainRange 80 +#define _sPC1 96 +#define _sPC2 112 +#define _sPC3 128 +#define _sPC4 144 +#define _sPC5 160 +#define _sPC6 176 +#define _iHalf 192 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_coshf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* + * Implementation + * Abs argument + */ + movups _sSign+__svml_scosh_data_internal(%rip), %xmm1 + +/* + * Load argument + * dM = x/log(2) + RShifter + */ + movups _sInvLn2+__svml_scosh_data_internal(%rip), %xmm9 + andnps %xmm0, %xmm1 + mulps %xmm1, %xmm9 + +/* Check for overflow\underflow */ + movaps %xmm1, %xmm3 + movups _sShifter+__svml_scosh_data_internal(%rip), %xmm4 + movups _sLn2hi+__svml_scosh_data_internal(%rip), %xmm5 + addps %xmm4, %xmm9 + +/* + * R + * sN = sM - RShifter + */ + movaps %xmm9, %xmm6 + +/* + * G1,G2 2^N,2^(-N) + * iM now is an EXP(2^N) + */ + pslld $23, %xmm9 + movups _sLn2lo+__svml_scosh_data_internal(%rip), %xmm7 + subps %xmm4, %xmm6 + +/* sR = sX - sN*Log2_hi */ + mulps %xmm6, %xmm5 + +/* sR = (sX - sN*Log2_hi) - sN*Log2_lo */ + mulps %xmm6, %xmm7 + movdqu _iDomainRange+__svml_scosh_data_internal(%rip), %xmm2 + pcmpgtd %xmm2, %xmm3 + pcmpeqd %xmm1, %xmm2 + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*(a5+{v1 r^2*a7})))) = r + r*(r^2*(a3+r^2*(a5+r^2*a7))) .... + * sSinh_r = (a3+r^2*a5) + */ + movups _sPC5+__svml_scosh_data_internal(%rip), %xmm10 + por %xmm2, %xmm3 + +/* + * sinh(X) = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) + * sOut = (a4 +a6*sR2) + */ + movups _sPC6+__svml_scosh_data_internal(%rip), %xmm11 + subps %xmm5, %xmm1 + movmskps %xmm3, %edx + movdqu _iHalf+__svml_scosh_data_internal(%rip), %xmm8 + subps %xmm7, %xmm1 + +/* sR2 = sR^2,shaffled */ + movaps %xmm1, %xmm13 + movdqa %xmm8, %xmm2 + mulps %xmm1, %xmm13 + paddd %xmm9, %xmm2 + mulps %xmm13, %xmm10 + psubd %xmm9, %xmm8 + mulps %xmm13, %xmm11 + addps _sPC3+__svml_scosh_data_internal(%rip), %xmm10 + addps _sPC4+__svml_scosh_data_internal(%rip), %xmm11 + +/* sSinh_r = r^2*(a3+r^2*a5) */ + mulps %xmm13, %xmm10 + +/* sOut = a2+sR2*(a4+a6*sR2) */ + mulps %xmm13, %xmm11 + +/* sSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + mulps %xmm1, %xmm10 + addps _sPC2+__svml_scosh_data_internal(%rip), %xmm11 + addps %xmm10, %xmm1 + +/* sOut = sR2*(a2+sR2*(a4+a6*sR2) */ + mulps %xmm11, %xmm13 + +/* sG1 = 2^(N-1)-2^(-N-1) */ + movdqa %xmm2, %xmm12 + +/* sG2 = 2^(N-1)+2^(-N-1) */ + addps %xmm8, %xmm2 + subps %xmm8, %xmm12 + +/* sOut = sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + mulps %xmm2, %xmm13 + +/* sOut = sG1*sinh(dR)+sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + mulps %xmm1, %xmm12 + addps %xmm12, %xmm13 + +/* sOut = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + addps %xmm13, %xmm2 + +/* Ret H */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm2, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm2, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm2 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm2 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call coshf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_coshf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_scosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _sInvLn2[4][1]; + __declspec(align(16)) VUINT32 _sLn2hi[4][1]; + __declspec(align(16)) VUINT32 _sLn2lo[4][1]; + __declspec(align(16)) VUINT32 _sSign[4][1]; + __declspec(align(16)) VUINT32 _sShifter[4][1]; + __declspec(align(16)) VUINT32 _iDomainRange[4][1]; + __declspec(align(16)) VUINT32 _sPC1[4][1]; + __declspec(align(16)) VUINT32 _sPC2[4][1]; + __declspec(align(16)) VUINT32 _sPC3[4][1]; + __declspec(align(16)) VUINT32 _sPC4[4][1]; + __declspec(align(16)) VUINT32 _sPC5[4][1]; + __declspec(align(16)) VUINT32 _sPC6[4][1]; + __declspec(align(16)) VUINT32 _iHalf[4][1]; +} __svml_scosh_data_internal; +#endif +__svml_scosh_data_internal: + .long 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B /* _sInvLn2 */ //k=0 + .align 16 + .long 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000 /* _sLn2hi */ + .align 16 + .long 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4 /* _sLn2lo */ + .align 16 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSign */ + .align 16 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 16 + .long 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E /* _iDomainRange */ + .align 16 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC1=1 */ + .align 16 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sPC2 */ + .align 16 + .long 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57 /* _sPC3 */ + .align 16 + .long 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72 /* _sPC4 */ + .align 16 + .long 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461 /* _sPC5 */ + .align 16 + .long 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3 /* _sPC6 */ + // Integer constants + .align 16 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _iHalf*/ + .align 16 + .type __svml_scosh_data_internal,@object + .size __svml_scosh_data_internal,.-__svml_scosh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core-sse.S new file mode 100644 index 0000000000..c27229e1fa --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized coshf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_coshf _ZGVdN8v_coshf_sse_wrapper +#include "../svml_s_coshf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core.c new file mode 100644 index 0000000000..e82818b2c9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized coshf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_coshf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_coshf, __GI__ZGVdN8v_coshf, + __redirect__ZGVdN8v_coshf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core_avx2.S new file mode 100644 index 0000000000..9149061e7e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_coshf8_core_avx2.S @@ -0,0 +1,308 @@ +/* Function coshf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute cosh(x) as (exp(x)+exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * cosh(NaN) = quiet NaN, and raise invalid exception + * cosh(INF) = that INF + * cosh(0) = 1 + * cosh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_scosh_data_internal + */ +#define _sInvLn2 0 +#define _sLn2hi 32 +#define _sLn2lo 64 +#define _sSign 96 +#define _sShifter 128 +#define _iDomainRange 160 +#define _sPC1 192 +#define _sPC2 224 +#define _sPC3 256 +#define _sPC4 288 +#define _sPC5 320 +#define _sPC6 352 +#define _iHalf 384 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_coshf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + vmovups _sSign+__svml_scosh_data_internal(%rip), %ymm2 + vmovups _sShifter+__svml_scosh_data_internal(%rip), %ymm7 + +/* + * Load argument + * dM = x/log(2) + RShifter + */ + vmovups _sInvLn2+__svml_scosh_data_internal(%rip), %ymm10 + vmovups _sLn2hi+__svml_scosh_data_internal(%rip), %ymm8 + vmovups _iDomainRange+__svml_scosh_data_internal(%rip), %ymm3 + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*(a5+{v1 r^2*a7})))) = r + r*(r^2*(a3+r^2*(a5+r^2*a7))) .... + * sSinh_r = (a3+r^2*a5) + */ + vmovups _sPC5+__svml_scosh_data_internal(%rip), %ymm15 + vmovups _iHalf+__svml_scosh_data_internal(%rip), %ymm11 + vmovaps %ymm0, %ymm1 + +/* + * Implementation + * Abs argument + */ + vandnps %ymm1, %ymm2, %ymm0 + vfmadd213ps %ymm7, %ymm0, %ymm10 + +/* + * R + * sN = sM - RShifter + */ + vsubps %ymm7, %ymm10, %ymm9 + +/* + * G1,G2 2^N,2^(-N) + * iM now is an EXP(2^N) + */ + vpslld $23, %ymm10, %ymm12 + +/* Check for overflow\underflow */ + vpcmpgtd %ymm3, %ymm0, %ymm4 + vpcmpeqd %ymm3, %ymm0, %ymm5 + +/* sR = sX - sN*Log2_hi */ + vfnmadd231ps %ymm8, %ymm9, %ymm0 + vpaddd %ymm12, %ymm11, %ymm13 + vpsubd %ymm12, %ymm11, %ymm14 + vpor %ymm5, %ymm4, %ymm6 + +/* sR = (sX - sN*Log2_hi) - sN*Log2_lo */ + vfnmadd231ps _sLn2lo+__svml_scosh_data_internal(%rip), %ymm9, %ymm0 + +/* sG1 = 2^(N-1)-2^(-N-1) */ + vsubps %ymm14, %ymm13, %ymm4 + +/* sG2 = 2^(N-1)+2^(-N-1) */ + vaddps %ymm14, %ymm13, %ymm3 + +/* sR2 = sR^2,shaffled */ + vmulps %ymm0, %ymm0, %ymm2 + vfmadd213ps _sPC3+__svml_scosh_data_internal(%rip), %ymm2, %ymm15 + +/* sSinh_r = r^2*(a3+r^2*a5) */ + vmulps %ymm15, %ymm2, %ymm13 + +/* sSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + vfmadd213ps %ymm0, %ymm0, %ymm13 + +/* + * sinh(X) = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) + * sOut = (a4 +a6*sR2) + */ + vmovups _sPC6+__svml_scosh_data_internal(%rip), %ymm0 + vfmadd213ps _sPC4+__svml_scosh_data_internal(%rip), %ymm2, %ymm0 + +/* sOut = a2+sR2*(a4+a6*sR2) */ + vfmadd213ps _sPC2+__svml_scosh_data_internal(%rip), %ymm2, %ymm0 + +/* sOut = sR2*(a2+sR2*(a4+a6*sR2) */ + vmulps %ymm0, %ymm2, %ymm15 + +/* sOut = sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vmulps %ymm15, %ymm3, %ymm14 + +/* sOut = sG1*sinh(dR)+sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vfmadd213ps %ymm14, %ymm13, %ymm4 + vmovmskps %ymm6, %edx + +/* sOut = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vaddps %ymm4, %ymm3, %ymm0 + +/* Ret H */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm1, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call coshf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_coshf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_scosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _sInvLn2[8][1]; + __declspec(align(32)) VUINT32 _sLn2hi[8][1]; + __declspec(align(32)) VUINT32 _sLn2lo[8][1]; + __declspec(align(32)) VUINT32 _sSign[8][1]; + __declspec(align(32)) VUINT32 _sShifter[8][1]; + __declspec(align(32)) VUINT32 _iDomainRange[8][1]; + __declspec(align(32)) VUINT32 _sPC1[8][1]; + __declspec(align(32)) VUINT32 _sPC2[8][1]; + __declspec(align(32)) VUINT32 _sPC3[8][1]; + __declspec(align(32)) VUINT32 _sPC4[8][1]; + __declspec(align(32)) VUINT32 _sPC5[8][1]; + __declspec(align(32)) VUINT32 _sPC6[8][1]; + __declspec(align(32)) VUINT32 _iHalf[8][1]; +} __svml_scosh_data_internal; +#endif +__svml_scosh_data_internal: + .long 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B /* _sInvLn2 */ //k=0 + .align 32 + .long 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000 /* _sLn2hi */ + .align 32 + .long 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4 /* _sLn2lo */ + .align 32 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSign */ + .align 32 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 32 + .long 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E /* _iDomainRange */ + .align 32 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC1=1 */ + .align 32 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sPC2 */ + .align 32 + .long 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57 /* _sPC3 */ + .align 32 + .long 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72 /* _sPC4 */ + .align 32 + .long 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461 /* _sPC5 */ + .align 32 + .long 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3 /* _sPC6 */ + // Integer constants + .align 32 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _iHalf*/ + .align 32 + .type __svml_scosh_data_internal,@object + .size __svml_scosh_data_internal,.-__svml_scosh_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_cosh2_core.S b/sysdeps/x86_64/fpu/svml_d_cosh2_core.S new file mode 100644 index 0000000000..f95952cfe5 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cosh2_core.S @@ -0,0 +1,29 @@ +/* Function cosh vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_cosh) +WRAPPER_IMPL_SSE2 cosh +END (_ZGVbN2v_cosh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_cosh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_cosh4_core.S b/sysdeps/x86_64/fpu/svml_d_cosh4_core.S new file mode 100644 index 0000000000..cc24d0fb6b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cosh4_core.S @@ -0,0 +1,29 @@ +/* Function cosh vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_cosh) +WRAPPER_IMPL_AVX _ZGVbN2v_cosh +END (_ZGVdN4v_cosh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_cosh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_cosh4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_cosh4_core_avx.S new file mode 100644 index 0000000000..4323f5e308 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cosh4_core_avx.S @@ -0,0 +1,25 @@ +/* Function cosh vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_cosh) +WRAPPER_IMPL_AVX _ZGVbN2v_cosh +END (_ZGVcN4v_cosh) diff --git a/sysdeps/x86_64/fpu/svml_d_cosh8_core.S b/sysdeps/x86_64/fpu/svml_d_cosh8_core.S new file mode 100644 index 0000000000..90ee1ca125 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cosh8_core.S @@ -0,0 +1,25 @@ +/* Function cosh vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_cosh) +WRAPPER_IMPL_AVX512 _ZGVdN4v_cosh +END (_ZGVeN8v_cosh) diff --git a/sysdeps/x86_64/fpu/svml_s_coshf16_core.S b/sysdeps/x86_64/fpu/svml_s_coshf16_core.S new file mode 100644 index 0000000000..fe243b8b94 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_coshf16_core.S @@ -0,0 +1,25 @@ +/* Function coshf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_coshf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_coshf +END (_ZGVeN16v_coshf) diff --git a/sysdeps/x86_64/fpu/svml_s_coshf4_core.S b/sysdeps/x86_64/fpu/svml_s_coshf4_core.S new file mode 100644 index 0000000000..b55ede6e38 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_coshf4_core.S @@ -0,0 +1,29 @@ +/* Function coshf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_coshf) +WRAPPER_IMPL_SSE2 coshf +END (_ZGVbN4v_coshf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_coshf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_coshf8_core.S b/sysdeps/x86_64/fpu/svml_s_coshf8_core.S new file mode 100644 index 0000000000..3ea02d0f19 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_coshf8_core.S @@ -0,0 +1,29 @@ +/* Function coshf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_coshf) +WRAPPER_IMPL_AVX _ZGVbN4v_coshf +END (_ZGVdN8v_coshf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_coshf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_coshf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_coshf8_core_avx.S new file mode 100644 index 0000000000..9b3002f7c9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_coshf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function coshf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_coshf) +WRAPPER_IMPL_AVX _ZGVbN4v_coshf +END (_ZGVcN8v_coshf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx.c new file mode 100644 index 0000000000..1dd311a562 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-cosh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx2.c new file mode 100644 index 0000000000..1dd311a562 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-cosh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx512f.c new file mode 100644 index 0000000000..1dd311a562 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cosh-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-cosh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cosh.c b/sysdeps/x86_64/fpu/test-double-libmvec-cosh.c new file mode 100644 index 0000000000..cf49ec5d87 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cosh.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC cosh +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 256e8f07c9..68c449e04a 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVbN2v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVbN2vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVbN2v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVbN2v_exp10) +VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVbN2v_cosh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 9de1dab2c2..df67306373 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVdN4v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVdN4vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVdN4v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVdN4v_exp10) +VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVdN4v_cosh) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 43865ab099..1a6731098f 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVcN4v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVcN4vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVcN4v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVcN4v_exp10) +VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVcN4v_cosh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 5dbdacf617..4cdfa918e8 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asin), _ZGVeN8v_asin) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVeN8vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVeN8v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVeN8v_exp10) +VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVeN8v_cosh) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx.c new file mode 100644 index 0000000000..905dc3ca4a --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-coshf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx2.c new file mode 100644 index 0000000000..905dc3ca4a --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-coshf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx512f.c new file mode 100644 index 0000000000..905dc3ca4a --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-coshf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-coshf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-coshf.c b/sysdeps/x86_64/fpu/test-float-libmvec-coshf.c new file mode 100644 index 0000000000..94b899076b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-coshf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC coshf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index c159c8f583..47a9862233 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVeN16v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVeN16vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVeN16v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVeN16v_exp10f) +VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVeN16v_coshf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index c745ef744a..e7c5410e7b 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVbN4v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVbN4vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVbN4v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVbN4v_exp10f) +VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVbN4v_coshf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index c9226cf4dc..b8e9d48cd6 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVdN8v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVdN8vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVdN8v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVdN8v_exp10f) +VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVdN8v_coshf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 92970c5ace..328c827b27 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -33,6 +33,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (asinf), _ZGVcN8v_asinf) VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVcN8vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVcN8v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVcN8v_exp10f) +VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVcN8v_coshf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573813 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=x5X8IzEC; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm2G35Knz9sVq for ; Wed, 29 Dec 2021 07:13:42 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A7EB13858438 for ; Tue, 28 Dec 2021 20:13:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A7EB13858438 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722419; bh=HOt7z4aOwT6TyVSNq5HRinREtr6u28ZEeTS5i6iURM8=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=x5X8IzECNYnBNieQncpmODwCrQBViWCYAPwu+ikyzkm6+0UTXMTrHywkc4wAAPEyQ mTe+bJGZYjODgHSoI9NElvB41XzB80zBOso/oz5S6o9T5OeOvhoF5FL8LOyqbgLK+C maPYlkaix4SYH15p5PBIZnl0683CEwIyb1aXV4Iw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by sourceware.org (Postfix) with ESMTPS id 63D4F3858417 for ; Tue, 28 Dec 2021 20:11:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 63D4F3858417 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="238958491" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="238958491" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="468218870" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga003.jf.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsb016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 07/18] x86-64: Add vector expm1/expm1f implementation to libmvec Date: Tue, 28 Dec 2021 12:11:19 -0800 Message-Id: <20211228201130.737370-8-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized expm1/expm1f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector expm1/expm1f with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_expm12_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_expm12_core.c | 27 ++ .../fpu/multiarch/svml_d_expm12_core_sse4.S | 421 ++++++++++++++++++ .../fpu/multiarch/svml_d_expm14_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_expm14_core.c | 27 ++ .../fpu/multiarch/svml_d_expm14_core_avx2.S | 408 +++++++++++++++++ .../fpu/multiarch/svml_d_expm18_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_expm18_core.c | 27 ++ .../fpu/multiarch/svml_d_expm18_core_avx512.S | 334 ++++++++++++++ .../fpu/multiarch/svml_s_expm1f16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_expm1f16_core.c | 28 ++ .../multiarch/svml_s_expm1f16_core_avx512.S | 281 ++++++++++++ .../fpu/multiarch/svml_s_expm1f4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_expm1f4_core.c | 28 ++ .../fpu/multiarch/svml_s_expm1f4_core_sse4.S | 358 +++++++++++++++ .../fpu/multiarch/svml_s_expm1f8_core-sse.S | 20 + .../fpu/multiarch/svml_s_expm1f8_core.c | 28 ++ .../fpu/multiarch/svml_s_expm1f8_core_avx2.S | 351 +++++++++++++++ sysdeps/x86_64/fpu/svml_d_expm12_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_expm14_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_expm14_core_avx.S | 25 ++ sysdeps/x86_64/fpu/svml_d_expm18_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_expm1f16_core.S | 25 ++ sysdeps/x86_64/fpu/svml_s_expm1f4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_expm1f8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_expm1f8_core_avx.S | 25 ++ .../fpu/test-double-libmvec-expm1-avx.c | 1 + .../fpu/test-double-libmvec-expm1-avx2.c | 1 + .../fpu/test-double-libmvec-expm1-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-expm1.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-expm1f-avx.c | 1 + .../fpu/test-float-libmvec-expm1f-avx2.c | 1 + .../fpu/test-float-libmvec-expm1f-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-expm1f.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 2725 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_expm12_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_expm14_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_expm14_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_expm18_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_expm1f16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_expm1f4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_expm1f8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_expm1f8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-expm1.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-expm1f.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 35c6ac57a8..28dc4a82c5 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -175,4 +175,15 @@ #define __DECL_SIMD_coshf32x #define __DECL_SIMD_coshf64x #define __DECL_SIMD_coshf128x + +#define __DECL_SIMD_expm1 +#define __DECL_SIMD_expm1f +#define __DECL_SIMD_expm1l +#define __DECL_SIMD_expm1f16 +#define __DECL_SIMD_expm1f32 +#define __DECL_SIMD_expm1f64 +#define __DECL_SIMD_expm1f128 +#define __DECL_SIMD_expm1f32x +#define __DECL_SIMD_expm1f64x +#define __DECL_SIMD_expm1f128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 60a314f69e..c57adc8ace 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -116,7 +116,7 @@ __MATHCALL_VEC (exp10,, (_Mdouble_ __x)); #if defined __USE_XOPEN_EXTENDED || defined __USE_ISOC99 /* Return exp(X) - 1. */ -__MATHCALL (expm1,, (_Mdouble_ __x)); +__MATHCALL_VEC (expm1,, (_Mdouble_ __x)); /* Return log(1 + X). */ __MATHCALL (log1p,, (_Mdouble_ __x)); diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 4907680143..c9d3213bd3 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -52,6 +52,7 @@ GLIBC_2.35 _ZGVbN2v_atan F GLIBC_2.35 _ZGVbN2v_cosh F GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F +GLIBC_2.35 _ZGVbN2v_expm1 F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F @@ -59,6 +60,7 @@ GLIBC_2.35 _ZGVbN4v_atanf F GLIBC_2.35 _ZGVbN4v_coshf F GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F +GLIBC_2.35 _ZGVbN4v_expm1f F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F @@ -66,6 +68,7 @@ GLIBC_2.35 _ZGVcN4v_atan F GLIBC_2.35 _ZGVcN4v_cosh F GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F +GLIBC_2.35 _ZGVcN4v_expm1 F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F @@ -73,6 +76,7 @@ GLIBC_2.35 _ZGVcN8v_atanf F GLIBC_2.35 _ZGVcN8v_coshf F GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F +GLIBC_2.35 _ZGVcN8v_expm1f F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F @@ -80,6 +84,7 @@ GLIBC_2.35 _ZGVdN4v_atan F GLIBC_2.35 _ZGVdN4v_cosh F GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F +GLIBC_2.35 _ZGVdN4v_expm1 F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F @@ -87,6 +92,7 @@ GLIBC_2.35 _ZGVdN8v_atanf F GLIBC_2.35 _ZGVdN8v_coshf F GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F +GLIBC_2.35 _ZGVdN8v_expm1f F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F @@ -94,6 +100,7 @@ GLIBC_2.35 _ZGVeN16v_atanf F GLIBC_2.35 _ZGVeN16v_coshf F GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F +GLIBC_2.35 _ZGVeN16v_expm1f F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F @@ -101,4 +108,5 @@ GLIBC_2.35 _ZGVeN8v_atan F GLIBC_2.35 _ZGVeN8v_cosh F GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F +GLIBC_2.35 _ZGVeN8v_expm1 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 708e81b3d0..e2f98e176f 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -86,6 +86,10 @@ # define __DECL_SIMD_cosh __DECL_SIMD_x86_64 # undef __DECL_SIMD_coshf # define __DECL_SIMD_coshf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_expm1 +# define __DECL_SIMD_expm1 __DECL_SIMD_x86_64 +# undef __DECL_SIMD_expm1f +# define __DECL_SIMD_expm1f __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 81d0238ebf..43233059f6 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -42,6 +42,8 @@ !GCC$ builtin (exp10f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cosh) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (coshf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (expm1) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (expm1f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -69,3 +71,5 @@ !GCC$ builtin (exp10f) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosh) attributes simd (notinbranch) if('x32') !GCC$ builtin (coshf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (expm1) attributes simd (notinbranch) if('x32') +!GCC$ builtin (expm1f) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 5bc2df134f..8de8214971 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -30,6 +30,7 @@ libmvec-funcs = \ exp \ exp10 \ exp2 \ + expm1 \ hypot \ log \ pow \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 53346d16a2..58debb2dbe 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -20,6 +20,7 @@ libmvec { _ZGVbN2v_cosh; _ZGVcN4v_cosh; _ZGVdN4v_cosh; _ZGVeN8v_cosh; _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; + _ZGVbN2v_expm1; _ZGVcN4v_expm1; _ZGVdN4v_expm1; _ZGVeN8v_expm1; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; @@ -27,6 +28,7 @@ libmvec { _ZGVbN4v_coshf; _ZGVcN8v_coshf; _ZGVdN8v_coshf; _ZGVeN16v_coshf; _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; + _ZGVbN4v_expm1f; _ZGVcN8v_expm1f; _ZGVdN8v_expm1f; _ZGVeN16v_expm1f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; } } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index ac70f15208..f05ece8c8a 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1395,6 +1395,26 @@ float: 1 float128: 3 ldouble: 4 +Function: "expm1_vlen16": +float: 1 + +Function: "expm1_vlen2": +double: 1 + +Function: "expm1_vlen4": +double: 1 +float: 1 + +Function: "expm1_vlen4_avx2": +double: 1 + +Function: "expm1_vlen8": +double: 1 +float: 1 + +Function: "expm1_vlen8_avx2": +float: 1 + Function: "gamma": double: 4 float: 7 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core-sse2.S new file mode 100644 index 0000000000..e8cb6faaca --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized expm1, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_expm1 _ZGVbN2v_expm1_sse2 +#include "../svml_d_expm12_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core.c new file mode 100644 index 0000000000..9c794e932e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized expm1, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_expm1 +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_expm1, __GI__ZGVbN2v_expm1, __redirect__ZGVbN2v_expm1) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core_sse4.S new file mode 100644 index 0000000000..db763e3856 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm12_core_sse4.S @@ -0,0 +1,421 @@ +/* Function expm1 vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * N = (int)(x*2^k/log(2.0)), R = x - N*log(2)/2^k + * exp(x) = 2^(N/2^k) * poly(R) is computed in high-low parts + * expm1(x) = exp(x)-1 is then obtained via multi-precision computation + * + * + */ + +/* Offsets for data table __svml_dexpm1_data_internal + */ +#define Expm1_HA_table 0 +#define poly_coeff 2048 +#define Log2e 2112 +#define L2H 2128 +#define L2L 2144 +#define ExpAddConst 2160 +#define IndexMask 2176 +#define ExpMask 2192 +#define MOne 2208 +#define AbsMask 2224 +#define Threshold 2240 +#define L2 2256 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_expm1_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + movaps %xmm0, %xmm2 + movups Log2e+__svml_dexpm1_data_internal(%rip), %xmm7 + lea __svml_dexpm1_data_internal(%rip), %rsi + mulpd %xmm0, %xmm7 + movups .FLT_10(%rip), %xmm3 + addpd %xmm3, %xmm7 + subpd %xmm3, %xmm7 + +/* argument reduction */ + movups L2H+__svml_dexpm1_data_internal(%rip), %xmm4 + mulpd %xmm7, %xmm4 + movups L2L+__svml_dexpm1_data_internal(%rip), %xmm5 + mulpd %xmm7, %xmm5 + subpd %xmm4, %xmm2 + subpd %xmm5, %xmm2 + +/* polynomial */ + movups poly_coeff+__svml_dexpm1_data_internal(%rip), %xmm12 + movaps %xmm2, %xmm14 + mulpd %xmm2, %xmm12 + mulpd %xmm2, %xmm14 + addpd poly_coeff+16+__svml_dexpm1_data_internal(%rip), %xmm12 + movups ExpAddConst+__svml_dexpm1_data_internal(%rip), %xmm15 + addpd %xmm7, %xmm15 + mulpd %xmm14, %xmm12 + movups poly_coeff+32+__svml_dexpm1_data_internal(%rip), %xmm13 + mulpd %xmm2, %xmm13 + +/* table lookup */ + movdqu IndexMask+__svml_dexpm1_data_internal(%rip), %xmm8 + pand %xmm15, %xmm8 + movups AbsMask+__svml_dexpm1_data_internal(%rip), %xmm1 + pshufd $2, %xmm8, %xmm9 + movaps %xmm1, %xmm6 + movd %xmm8, %eax + andps %xmm0, %xmm6 + movd %xmm9, %ecx + andnps %xmm0, %xmm1 + movdqu ExpMask+__svml_dexpm1_data_internal(%rip), %xmm11 + pand %xmm11, %xmm15 + cmpnlepd Threshold+__svml_dexpm1_data_internal(%rip), %xmm6 + addpd poly_coeff+48+__svml_dexpm1_data_internal(%rip), %xmm13 + movmskpd %xmm6, %edx + psllq $41, %xmm15 + +/* T-1 */ + movups MOne+__svml_dexpm1_data_internal(%rip), %xmm4 + movslq %eax, %rax + movslq %ecx, %rcx + addpd %xmm12, %xmm13 + movups (%rsi,%rax), %xmm3 + movups (%rsi,%rcx), %xmm10 + movaps %xmm3, %xmm6 + unpckhpd %xmm10, %xmm3 + +/* Th1 = (Th-1) + Tl */ + mulpd %xmm15, %xmm3 + mulpd %xmm13, %xmm14 + unpcklpd %xmm10, %xmm6 + orps %xmm15, %xmm6 + addpd %xmm4, %xmm6 + addpd %xmm14, %xmm2 + addpd %xmm3, %xmm6 + +/* T = Th+Tl */ + movaps %xmm6, %xmm5 + subpd %xmm4, %xmm5 + mulpd %xmm5, %xmm2 + addpd %xmm2, %xmm6 + orps %xmm1, %xmm6 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm0 xmm6 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm6, %xmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm6, 48(%rsp) + # LOE rbx r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm6 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm6 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call expm1@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN2v_expm1_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dexpm1_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Expm1_HA_table[(1<<8)][2]; + __declspec(align(16)) VUINT32 poly_coeff[4][2][2]; + __declspec(align(16)) VUINT32 Log2e[2][2]; + __declspec(align(16)) VUINT32 L2H[2][2]; + __declspec(align(16)) VUINT32 L2L[2][2]; + __declspec(align(16)) VUINT32 ExpAddConst[2][2]; + __declspec(align(16)) VUINT32 IndexMask[2][2]; + __declspec(align(16)) VUINT32 ExpMask[2][2]; + __declspec(align(16)) VUINT32 MOne[2][2]; + __declspec(align(16)) VUINT32 AbsMask[2][2]; + __declspec(align(16)) VUINT32 Threshold[2][2]; + __declspec(align(16)) VUINT32 L2[2][2]; +} __svml_dexpm1_data_internal; +#endif +__svml_dexpm1_data_internal: + /* Expm1_HA_table */ + .quad 0x0000000000000000, 0x0000000000000000 + .quad 0x0000163da8000000, 0x3e3fb33356d84a67 + .quad 0x00002c9a40000000, 0xbe3887f9f1190835 + .quad 0x00004315e8000000, 0x3e1b9fe12f5ce3e7 + .quad 0x000059b0d0000000, 0x3e48ac2ba1d73e2a + .quad 0x0000706b28000000, 0x3e3ddf6ddc6dc404 + .quad 0x0000874518000000, 0x3e1d66f20230d7c9 + .quad 0x00009e3ec8000000, 0x3e46379c1a290f03 + .quad 0x0000b55870000000, 0xbe4833b784eb3a37 + .quad 0x0000cc9228000000, 0x3e4b923fba03db83 + .quad 0x0000e3ec30000000, 0x3e469e8d10103a17 + .quad 0x0000fb66b0000000, 0xbdb2ce50dcdf6e22 + .quad 0x00011301d0000000, 0x3df25b50a4ebbf1b + .quad 0x00012abdc0000000, 0x3e1b0c72fee4aeb5 + .quad 0x0001429ab0000000, 0xbe356d2204cbefe7 + .quad 0x00015a98c8000000, 0x3e24b1ca24901aae + .quad 0x000172b840000000, 0xbe4c15742919041c + .quad 0x00018af938000000, 0x3e2191bd3777ee17 + .quad 0x0001a35be8000000, 0x3e4b7e5ba9e5b4c8 + .quad 0x0001bbe088000000, 0xbe4fdd19632a70c7 + .quad 0x0001d48730000000, 0x3e368b9aa7805b80 + .quad 0x0001ed5020000000, 0x3e47e6c8e5c40d00 + .quad 0x0002063b88000000, 0x3e18a3358ee3bac1 + .quad 0x00021f4990000000, 0x3e37ddc962552fd3 + .quad 0x0002387a70000000, 0xbe38a9dc7993e052 + .quad 0x000251ce50000000, 0xbe135670329f5521 + .quad 0x00026b4568000000, 0xbe40ec1916d42cc6 + .quad 0x000284dfe0000000, 0x3e3f5638096cf15d + .quad 0x00029e9df8000000, 0xbe470108f69ed175 + .quad 0x0002b87fd0000000, 0x3e2b5b31ffbbd48d + .quad 0x0002d285a8000000, 0xbe31bfcf4bff6e2b + .quad 0x0002ecafa8000000, 0x3e33e2f5611ca0f4 + .quad 0x000306fe08000000, 0x3e418db8a96f46ad + .quad 0x0003217100000000, 0xbe4d993e76563187 + .quad 0x00033c08b0000000, 0x3e4320b7fa64e431 + .quad 0x000356c560000000, 0xbe1b5803cdae772e + .quad 0x000371a738000000, 0xbe28aac6ab1d7560 + .quad 0x00038cae70000000, 0xbe47d13cd3d2b1a8 + .quad 0x0003a7db38000000, 0xbe48d30048af21b7 + .quad 0x0003c32dc0000000, 0x3e489d47242000f9 + .quad 0x0003dea650000000, 0xbe4f6e5eee525f6f + .quad 0x0003fa4508000000, 0xbe4a9bff22fa047f + .quad 0x0004160a20000000, 0x3e3f72e29f84325c + .quad 0x000431f5d8000000, 0x3e350a896dc70444 + .quad 0x00044e0860000000, 0x3e18624b40c4dbd0 + .quad 0x00046a41f0000000, 0xbe4717fd446d7686 + .quad 0x000486a2b8000000, 0xbe41f6197f61f2e2 + .quad 0x0004a32af0000000, 0x3e2afa7bcce5b17a + .quad 0x0004bfdad8000000, 0xbe464eaec715e343 + .quad 0x0004dcb298000000, 0x3e3fddd0d63b36ef + .quad 0x0004f9b278000000, 0xbe362d35952cc275 + .quad 0x000516daa0000000, 0x3e467b320e0897a9 + .quad 0x0005342b58000000, 0xbe362b07e20f57c4 + .quad 0x000551a4c8000000, 0x3e42ec9076297631 + .quad 0x00056f4738000000, 0xbe34ad8259913500 + .quad 0x00058d12d8000000, 0xbe4b41c016d6a1ea + .quad 0x0005ab07e0000000, 0xbe45bd5eb539b67f + .quad 0x0005c92688000000, 0x3e42ca35b80e258e + .quad 0x0005e76f18000000, 0xbe4296f5bc8b20da + .quad 0x000605e1b8000000, 0x3e376dc08b076f59 + .quad 0x0006247eb0000000, 0x3e0d2ac258f87d03 + .quad 0x0006434638000000, 0xbe4999e701c483c7 + .quad 0x0006623880000000, 0x3e42a91124893ecf + .quad 0x00068155d8000000, 0xbe4d9ab467bf1d47 + .quad 0x0006a09e68000000, 0xbe380c4336f74d05 + .quad 0x0006c01278000000, 0xbe47a12a08944ab3 + .quad 0x0006dfb240000000, 0xbe4cd72e886ef8ea + .quad 0x0006ff7df8000000, 0x3e3519483cf87e1b + .quad 0x00071f75e8000000, 0x3e2d8bee7ba46e1e + .quad 0x00073f9a48000000, 0x3e24b02e77ab934a + .quad 0x00075feb58000000, 0xbe3bd98374091656 + .quad 0x0007806950000000, 0xbe00d1604f328fec + .quad 0x0007a11470000000, 0x3e4f580c36bea881 + .quad 0x0007c1ed00000000, 0x3e330c1327c49334 + .quad 0x0007e2f338000000, 0xbe330b19defa2fd4 + .quad 0x0008042758000000, 0xbe4e0f2f724f90cc + .quad 0x0008258998000000, 0x3e34cce128acf88b + .quad 0x0008471a48000000, 0xbe3dc385331ad094 + .quad 0x000868d998000000, 0x3e4a2497640720ed + .quad 0x00088ac7d8000000, 0x3e38a669966530bd + .quad 0x0008ace540000000, 0x3e415506dadd3e2b + .quad 0x0008cf3218000000, 0xbe34abb7410d55e3 + .quad 0x0008f1ae98000000, 0x3e31577362b98274 + .quad 0x0009145b08000000, 0x3e4c8ffe2c4530da + .quad 0x00093737b0000000, 0x3e29b8bc9e8a0388 + .quad 0x00095a44c8000000, 0x3e4e4290774da41b + .quad 0x00097d82a0000000, 0xbe00d8d83a30b6f8 + .quad 0x0009a0f170000000, 0x3e2940f737462137 + .quad 0x0009c49180000000, 0x3e451f8480e3e236 + .quad 0x0009e86318000000, 0x3e3e323231824ca8 + .quad 0x000a0c6678000000, 0x3e4aef2b2594d6d4 + .quad 0x000a309bf0000000, 0xbe4dae966539f470 + .quad 0x000a5503b0000000, 0x3e41f12ae45a1225 + .quad 0x000a799e10000000, 0x3e49859ac3796fd9 + .quad 0x000a9e6b58000000, 0xbe44301205e0a6de + .quad 0x000ac36bc0000000, 0xbe0606431f9234cb + .quad 0x000ae89f98000000, 0x3e35ad3ad5e8734d + .quad 0x000b0e0728000000, 0x3e38db66590842ad + .quad 0x000b33a2b8000000, 0x3e13c57ebdaff43a + .quad 0x000b597290000000, 0xbe40d536338e3bf7 + .quad 0x000b7f76f0000000, 0x3e47daf237553d84 + .quad 0x000ba5b030000000, 0x3e2420c930819679 + .quad 0x000bcc1e90000000, 0x3e12f074891ee83d + .quad 0x000bf2c258000000, 0x3e4eb8f0442046b8 + .quad 0x000c199be0000000, 0xbe43d56b1eeef9a7 + .quad 0x000c40ab60000000, 0xbd87c2c975903ef8 + .quad 0x000c67f130000000, 0xbe3a82eb4b5dec80 + .quad 0x000c8f6d98000000, 0xbe4fc8c257729a1e + .quad 0x000cb720e0000000, 0xbe48837cb757e1a1 + .quad 0x000cdf0b58000000, 0xbe4511e031dd83b5 + .quad 0x000d072d48000000, 0x3e403c4bdc687918 + .quad 0x000d2f8708000000, 0x3deb13e315bc2473 + .quad 0x000d5818e0000000, 0xbe4822dbc6d12fd3 + .quad 0x000d80e318000000, 0xbe3367c68447b063 + .quad 0x000da9e600000000, 0x3e4ed9942b84600d + .quad 0x000dd321f0000000, 0x3e480da3025b4aef + .quad 0x000dfc9730000000, 0x3e4bdcdaf5cb4656 + .quad 0x000e264618000000, 0xbe4852f6baf6c4f0 + .quad 0x000e502ee8000000, 0xbe1d30027630bb40 + .quad 0x000e7a51f8000000, 0x3e4e3a641a5aa459 + .quad 0x000ea4afa0000000, 0x3e452486cc2c7b9d + .quad 0x000ecf4830000000, 0xbe438cc07b927e77 + .quad 0x000efa1bf0000000, 0xbe39ea5d888e02de + .quad 0x000f252b38000000, 0xbe2288ad162f2d20 + .quad 0x000f507658000000, 0x3e4b722a033a7c26 + .quad 0x000f7bfdb0000000, 0xbe431a0f63b7625a + .quad 0x000fa7c180000000, 0x3e39e90d82e90a7e + .quad 0x000fd3c228000000, 0x3e4c7b8f884badd2 + /*== poly_coeff[4] ==*/ + .align 16 + .quad 0x3f81111168877F38, 0x3f81111168877F38 /* coeff5 */ + .quad 0x3fa55555C2A9C0F3, 0x3fa55555C2A9C0F3 /* coeff4 */ + .quad 0x3fc555555555541D, 0x3fc555555555541D /* coeff3 */ + .quad 0x3fdFFFFFFFFFFE5C, 0x3fdFFFFFFFFFFE5C /* coeff2 */ + /*== Log2e ==*/ + .align 16 + .quad 0x40671547652B82FE, 0x40671547652B82FE + /*== L2H ==*/ + .align 16 + .quad 0x3f762e42fef80000, 0x3f762e42fef80000 + /*== L2L ==*/ + .align 16 + .quad 0x3d41cf79abc9e3b4, 0x3d41cf79abc9e3b4 + /*== ExpAddConst ==*/ + .align 16 + .quad 0x42f80000001ff800, 0x42f80000001ff800 + /*== IndexMask ==*/ + .align 16 + .quad 0x00000000000007f0, 0x00000000000007f0 + /*== ExpMask ==*/ + .align 16 + .quad 0x00000000003ff800, 0x00000000003ff800 + /*== MOne ==*/ + .align 16 + .quad 0xbff0000000000000, 0xbff0000000000000 + /*== AbsMask ==*/ + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff + /*== Threshold ==*/ + .align 16 + .quad 0x40861DA04CBAFE43, 0x40861DA04CBAFE43 + /*== L2 ==*/ + .align 16 + .quad 0x3f762e42fefa39ef, 0x3f762e42fefa39ef + .align 16 + .type __svml_dexpm1_data_internal,@object + .size __svml_dexpm1_data_internal,.-__svml_dexpm1_data_internal + .align 16 + +.FLT_10: + .long 0x00000000,0x43380000,0x00000000,0x43380000 + .type .FLT_10,@object + .size .FLT_10,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core-sse.S new file mode 100644 index 0000000000..e7016708d0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized expm1, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_expm1 _ZGVdN4v_expm1_sse_wrapper +#include "../svml_d_expm14_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core.c new file mode 100644 index 0000000000..4215d7dbaf --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized expm1, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_expm1 +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_expm1, __GI__ZGVdN4v_expm1, __redirect__ZGVdN4v_expm1) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core_avx2.S new file mode 100644 index 0000000000..c34f73a578 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm14_core_avx2.S @@ -0,0 +1,408 @@ +/* Function expm1 vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * N = (int)(x*2^k/log(2.0)), R = x - N*log(2)/2^k + * exp(x) = 2^(N/2^k) * poly(R) is computed in high-low parts + * expm1(x) = exp(x)-1 is then obtained via multi-precision computation + * + * + */ + +/* Offsets for data table __svml_dexpm1_data_internal + */ +#define Expm1_HA_table 0 +#define poly_coeff 2048 +#define Log2e 2176 +#define L2H 2208 +#define L2L 2240 +#define ExpAddConst 2272 +#define IndexMask 2304 +#define ExpMask 2336 +#define MOne 2368 +#define AbsMask 2400 +#define Threshold 2432 +#define L2 2464 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_expm1_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea __svml_dexpm1_data_internal(%rip), %r8 + vmovapd %ymm0, %ymm3 + vmulpd Log2e+__svml_dexpm1_data_internal(%rip), %ymm3, %ymm4 + +/* argument reduction */ + vmovupd L2H+__svml_dexpm1_data_internal(%rip), %ymm2 + vmovupd AbsMask+__svml_dexpm1_data_internal(%rip), %ymm5 + vroundpd $0, %ymm4, %ymm8 + vaddpd ExpAddConst+__svml_dexpm1_data_internal(%rip), %ymm8, %ymm0 + vfnmadd213pd %ymm3, %ymm8, %ymm2 + +/* table lookup */ + vandps IndexMask+__svml_dexpm1_data_internal(%rip), %ymm0, %ymm9 + vandpd %ymm5, %ymm3, %ymm6 + vcmpnle_uqpd Threshold+__svml_dexpm1_data_internal(%rip), %ymm6, %ymm7 + vfnmadd231pd L2L+__svml_dexpm1_data_internal(%rip), %ymm8, %ymm2 + vandnpd %ymm3, %ymm5, %ymm1 + vmovmskpd %ymm7, %eax + vmovupd poly_coeff+64+__svml_dexpm1_data_internal(%rip), %ymm7 + vmulpd %ymm2, %ymm2, %ymm8 + vfmadd213pd poly_coeff+96+__svml_dexpm1_data_internal(%rip), %ymm2, %ymm7 + vandps ExpMask+__svml_dexpm1_data_internal(%rip), %ymm0, %ymm0 + vextractf128 $1, %ymm9, %xmm10 + vmovd %xmm9, %edx + vmovd %xmm10, %esi + vpextrd $2, %xmm9, %ecx + vpextrd $2, %xmm10, %edi + movslq %edx, %rdx + movslq %ecx, %rcx + movslq %esi, %rsi + movslq %edi, %rdi + vmovupd (%r8,%rdx), %xmm13 + vmovupd (%r8,%rcx), %xmm14 + vmovupd (%r8,%rsi), %xmm4 + vmovupd (%r8,%rdi), %xmm5 + vunpcklpd %xmm14, %xmm13, %xmm11 + vunpcklpd %xmm5, %xmm4, %xmm12 + vpsllq $41, %ymm0, %ymm10 + vunpckhpd %xmm14, %xmm13, %xmm15 + vunpckhpd %xmm5, %xmm4, %xmm13 + vinsertf128 $1, %xmm12, %ymm11, %ymm6 + +/* polynomial */ + vmovupd poly_coeff+__svml_dexpm1_data_internal(%rip), %ymm12 + +/* T-1 */ + vmovupd MOne+__svml_dexpm1_data_internal(%rip), %ymm11 + vfmadd213pd poly_coeff+32+__svml_dexpm1_data_internal(%rip), %ymm2, %ymm12 + vfmadd213pd %ymm7, %ymm8, %ymm12 + vorpd %ymm10, %ymm6, %ymm9 + vfmadd213pd %ymm2, %ymm8, %ymm12 + vaddpd %ymm11, %ymm9, %ymm2 + vinsertf128 $1, %xmm13, %ymm15, %ymm14 + +/* Th1 = (Th-1) + Tl */ + vfmadd213pd %ymm2, %ymm10, %ymm14 + +/* T = Th+Tl */ + vsubpd %ymm11, %ymm14, %ymm0 + vfmadd213pd %ymm14, %ymm12, %ymm0 + vorpd %ymm1, %ymm0, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm3, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call expm1@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_expm1_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dexpm1_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Expm1_HA_table[(1<<8)][2]; + __declspec(align(32)) VUINT32 poly_coeff[4][4][2]; + __declspec(align(32)) VUINT32 Log2e[4][2]; + __declspec(align(32)) VUINT32 L2H[4][2]; + __declspec(align(32)) VUINT32 L2L[4][2]; + __declspec(align(32)) VUINT32 ExpAddConst[4][2]; + __declspec(align(32)) VUINT32 IndexMask[4][2]; + __declspec(align(32)) VUINT32 ExpMask[4][2]; + __declspec(align(32)) VUINT32 MOne[4][2]; + __declspec(align(32)) VUINT32 AbsMask[4][2]; + __declspec(align(32)) VUINT32 Threshold[4][2]; + __declspec(align(32)) VUINT32 L2[4][2]; +} __svml_dexpm1_data_internal; +#endif +__svml_dexpm1_data_internal: + /* Expm1_HA_table */ + .quad 0x0000000000000000, 0x0000000000000000 + .quad 0x0000163da8000000, 0x3e3fb33356d84a67 + .quad 0x00002c9a40000000, 0xbe3887f9f1190835 + .quad 0x00004315e8000000, 0x3e1b9fe12f5ce3e7 + .quad 0x000059b0d0000000, 0x3e48ac2ba1d73e2a + .quad 0x0000706b28000000, 0x3e3ddf6ddc6dc404 + .quad 0x0000874518000000, 0x3e1d66f20230d7c9 + .quad 0x00009e3ec8000000, 0x3e46379c1a290f03 + .quad 0x0000b55870000000, 0xbe4833b784eb3a37 + .quad 0x0000cc9228000000, 0x3e4b923fba03db83 + .quad 0x0000e3ec30000000, 0x3e469e8d10103a17 + .quad 0x0000fb66b0000000, 0xbdb2ce50dcdf6e22 + .quad 0x00011301d0000000, 0x3df25b50a4ebbf1b + .quad 0x00012abdc0000000, 0x3e1b0c72fee4aeb5 + .quad 0x0001429ab0000000, 0xbe356d2204cbefe7 + .quad 0x00015a98c8000000, 0x3e24b1ca24901aae + .quad 0x000172b840000000, 0xbe4c15742919041c + .quad 0x00018af938000000, 0x3e2191bd3777ee17 + .quad 0x0001a35be8000000, 0x3e4b7e5ba9e5b4c8 + .quad 0x0001bbe088000000, 0xbe4fdd19632a70c7 + .quad 0x0001d48730000000, 0x3e368b9aa7805b80 + .quad 0x0001ed5020000000, 0x3e47e6c8e5c40d00 + .quad 0x0002063b88000000, 0x3e18a3358ee3bac1 + .quad 0x00021f4990000000, 0x3e37ddc962552fd3 + .quad 0x0002387a70000000, 0xbe38a9dc7993e052 + .quad 0x000251ce50000000, 0xbe135670329f5521 + .quad 0x00026b4568000000, 0xbe40ec1916d42cc6 + .quad 0x000284dfe0000000, 0x3e3f5638096cf15d + .quad 0x00029e9df8000000, 0xbe470108f69ed175 + .quad 0x0002b87fd0000000, 0x3e2b5b31ffbbd48d + .quad 0x0002d285a8000000, 0xbe31bfcf4bff6e2b + .quad 0x0002ecafa8000000, 0x3e33e2f5611ca0f4 + .quad 0x000306fe08000000, 0x3e418db8a96f46ad + .quad 0x0003217100000000, 0xbe4d993e76563187 + .quad 0x00033c08b0000000, 0x3e4320b7fa64e431 + .quad 0x000356c560000000, 0xbe1b5803cdae772e + .quad 0x000371a738000000, 0xbe28aac6ab1d7560 + .quad 0x00038cae70000000, 0xbe47d13cd3d2b1a8 + .quad 0x0003a7db38000000, 0xbe48d30048af21b7 + .quad 0x0003c32dc0000000, 0x3e489d47242000f9 + .quad 0x0003dea650000000, 0xbe4f6e5eee525f6f + .quad 0x0003fa4508000000, 0xbe4a9bff22fa047f + .quad 0x0004160a20000000, 0x3e3f72e29f84325c + .quad 0x000431f5d8000000, 0x3e350a896dc70444 + .quad 0x00044e0860000000, 0x3e18624b40c4dbd0 + .quad 0x00046a41f0000000, 0xbe4717fd446d7686 + .quad 0x000486a2b8000000, 0xbe41f6197f61f2e2 + .quad 0x0004a32af0000000, 0x3e2afa7bcce5b17a + .quad 0x0004bfdad8000000, 0xbe464eaec715e343 + .quad 0x0004dcb298000000, 0x3e3fddd0d63b36ef + .quad 0x0004f9b278000000, 0xbe362d35952cc275 + .quad 0x000516daa0000000, 0x3e467b320e0897a9 + .quad 0x0005342b58000000, 0xbe362b07e20f57c4 + .quad 0x000551a4c8000000, 0x3e42ec9076297631 + .quad 0x00056f4738000000, 0xbe34ad8259913500 + .quad 0x00058d12d8000000, 0xbe4b41c016d6a1ea + .quad 0x0005ab07e0000000, 0xbe45bd5eb539b67f + .quad 0x0005c92688000000, 0x3e42ca35b80e258e + .quad 0x0005e76f18000000, 0xbe4296f5bc8b20da + .quad 0x000605e1b8000000, 0x3e376dc08b076f59 + .quad 0x0006247eb0000000, 0x3e0d2ac258f87d03 + .quad 0x0006434638000000, 0xbe4999e701c483c7 + .quad 0x0006623880000000, 0x3e42a91124893ecf + .quad 0x00068155d8000000, 0xbe4d9ab467bf1d47 + .quad 0x0006a09e68000000, 0xbe380c4336f74d05 + .quad 0x0006c01278000000, 0xbe47a12a08944ab3 + .quad 0x0006dfb240000000, 0xbe4cd72e886ef8ea + .quad 0x0006ff7df8000000, 0x3e3519483cf87e1b + .quad 0x00071f75e8000000, 0x3e2d8bee7ba46e1e + .quad 0x00073f9a48000000, 0x3e24b02e77ab934a + .quad 0x00075feb58000000, 0xbe3bd98374091656 + .quad 0x0007806950000000, 0xbe00d1604f328fec + .quad 0x0007a11470000000, 0x3e4f580c36bea881 + .quad 0x0007c1ed00000000, 0x3e330c1327c49334 + .quad 0x0007e2f338000000, 0xbe330b19defa2fd4 + .quad 0x0008042758000000, 0xbe4e0f2f724f90cc + .quad 0x0008258998000000, 0x3e34cce128acf88b + .quad 0x0008471a48000000, 0xbe3dc385331ad094 + .quad 0x000868d998000000, 0x3e4a2497640720ed + .quad 0x00088ac7d8000000, 0x3e38a669966530bd + .quad 0x0008ace540000000, 0x3e415506dadd3e2b + .quad 0x0008cf3218000000, 0xbe34abb7410d55e3 + .quad 0x0008f1ae98000000, 0x3e31577362b98274 + .quad 0x0009145b08000000, 0x3e4c8ffe2c4530da + .quad 0x00093737b0000000, 0x3e29b8bc9e8a0388 + .quad 0x00095a44c8000000, 0x3e4e4290774da41b + .quad 0x00097d82a0000000, 0xbe00d8d83a30b6f8 + .quad 0x0009a0f170000000, 0x3e2940f737462137 + .quad 0x0009c49180000000, 0x3e451f8480e3e236 + .quad 0x0009e86318000000, 0x3e3e323231824ca8 + .quad 0x000a0c6678000000, 0x3e4aef2b2594d6d4 + .quad 0x000a309bf0000000, 0xbe4dae966539f470 + .quad 0x000a5503b0000000, 0x3e41f12ae45a1225 + .quad 0x000a799e10000000, 0x3e49859ac3796fd9 + .quad 0x000a9e6b58000000, 0xbe44301205e0a6de + .quad 0x000ac36bc0000000, 0xbe0606431f9234cb + .quad 0x000ae89f98000000, 0x3e35ad3ad5e8734d + .quad 0x000b0e0728000000, 0x3e38db66590842ad + .quad 0x000b33a2b8000000, 0x3e13c57ebdaff43a + .quad 0x000b597290000000, 0xbe40d536338e3bf7 + .quad 0x000b7f76f0000000, 0x3e47daf237553d84 + .quad 0x000ba5b030000000, 0x3e2420c930819679 + .quad 0x000bcc1e90000000, 0x3e12f074891ee83d + .quad 0x000bf2c258000000, 0x3e4eb8f0442046b8 + .quad 0x000c199be0000000, 0xbe43d56b1eeef9a7 + .quad 0x000c40ab60000000, 0xbd87c2c975903ef8 + .quad 0x000c67f130000000, 0xbe3a82eb4b5dec80 + .quad 0x000c8f6d98000000, 0xbe4fc8c257729a1e + .quad 0x000cb720e0000000, 0xbe48837cb757e1a1 + .quad 0x000cdf0b58000000, 0xbe4511e031dd83b5 + .quad 0x000d072d48000000, 0x3e403c4bdc687918 + .quad 0x000d2f8708000000, 0x3deb13e315bc2473 + .quad 0x000d5818e0000000, 0xbe4822dbc6d12fd3 + .quad 0x000d80e318000000, 0xbe3367c68447b063 + .quad 0x000da9e600000000, 0x3e4ed9942b84600d + .quad 0x000dd321f0000000, 0x3e480da3025b4aef + .quad 0x000dfc9730000000, 0x3e4bdcdaf5cb4656 + .quad 0x000e264618000000, 0xbe4852f6baf6c4f0 + .quad 0x000e502ee8000000, 0xbe1d30027630bb40 + .quad 0x000e7a51f8000000, 0x3e4e3a641a5aa459 + .quad 0x000ea4afa0000000, 0x3e452486cc2c7b9d + .quad 0x000ecf4830000000, 0xbe438cc07b927e77 + .quad 0x000efa1bf0000000, 0xbe39ea5d888e02de + .quad 0x000f252b38000000, 0xbe2288ad162f2d20 + .quad 0x000f507658000000, 0x3e4b722a033a7c26 + .quad 0x000f7bfdb0000000, 0xbe431a0f63b7625a + .quad 0x000fa7c180000000, 0x3e39e90d82e90a7e + .quad 0x000fd3c228000000, 0x3e4c7b8f884badd2 + /*== poly_coeff[4] ==*/ + .align 32 + .quad 0x3f81111168877F38, 0x3f81111168877F38, 0x3f81111168877F38, 0x3f81111168877F38 /* coeff5 */ + .quad 0x3fa55555C2A9C0F3, 0x3fa55555C2A9C0F3, 0x3fa55555C2A9C0F3, 0x3fa55555C2A9C0F3 /* coeff4 */ + .quad 0x3fc555555555541D, 0x3fc555555555541D, 0x3fc555555555541D, 0x3fc555555555541D /* coeff3 */ + .quad 0x3fdFFFFFFFFFFE5C, 0x3fdFFFFFFFFFFE5C, 0x3fdFFFFFFFFFFE5C, 0x3fdFFFFFFFFFFE5C /* coeff2 */ + /*== Log2e ==*/ + .align 32 + .quad 0x40671547652B82FE, 0x40671547652B82FE, 0x40671547652B82FE, 0x40671547652B82FE + /*== L2H ==*/ + .align 32 + .quad 0x3f762e42fef80000, 0x3f762e42fef80000, 0x3f762e42fef80000, 0x3f762e42fef80000 + /*== L2L ==*/ + .align 32 + .quad 0x3d41cf79abc9e3b4, 0x3d41cf79abc9e3b4, 0x3d41cf79abc9e3b4, 0x3d41cf79abc9e3b4 + /*== ExpAddConst ==*/ + .align 32 + .quad 0x42f80000001ff800, 0x42f80000001ff800, 0x42f80000001ff800, 0x42f80000001ff800 + /*== IndexMask ==*/ + .align 32 + .quad 0x00000000000007f0, 0x00000000000007f0, 0x00000000000007f0, 0x00000000000007f0 + /*== ExpMask ==*/ + .align 32 + .quad 0x00000000003ff800, 0x00000000003ff800, 0x00000000003ff800, 0x00000000003ff800 + /*== MOne ==*/ + .align 32 + .quad 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000, 0xbff0000000000000 + /*== AbsMask ==*/ + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== Threshold ==*/ + .align 32 + .quad 0x40861DA04CBAFE43, 0x40861DA04CBAFE43, 0x40861DA04CBAFE43, 0x40861DA04CBAFE43 + /*== L2 ==*/ + .align 32 + .quad 0x3f762e42fefa39ef, 0x3f762e42fefa39ef, 0x3f762e42fefa39ef, 0x3f762e42fefa39ef + .align 32 + .type __svml_dexpm1_data_internal,@object + .size __svml_dexpm1_data_internal,.-__svml_dexpm1_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core-avx2.S new file mode 100644 index 0000000000..3b75d1de16 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized expm1, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_expm1 _ZGVeN8v_expm1_avx2_wrapper +#include "../svml_d_expm18_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core.c new file mode 100644 index 0000000000..860edf6df5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized expm1, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_expm1 +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_expm1, __GI__ZGVeN8v_expm1, __redirect__ZGVeN8v_expm1) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core_avx512.S new file mode 100644 index 0000000000..64cee91abd --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_expm18_core_avx512.S @@ -0,0 +1,334 @@ +/* Function expm1 vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * After computing exp(x) in high-low parts, an accurate computation is performed to obtain exp(x)-1 + * Typical exp() implementation, except that: + * - tables are small (16 elements), allowing for fast gathers + * - all arguments processed in the main path + * - final VSCALEF assists branch-free design (correct overflow/underflow and special case responses) + * - a VAND is used to ensure the reduced argument |R|<2, even for large inputs + * - RZ mode used to avoid oveflow to +/-Inf for x*log2(e); helps with special case handling + * + * + */ + +/* Offsets for data table __svml_dexpm1_data_internal_avx512 + */ +#define Exp_tbl_H 0 +#define Exp_tbl_L 128 +#define L2E 256 +#define Shifter 320 +#define Threshold 384 +#define SgnMask 448 +#define L2H 512 +#define L2L 576 +#define ZThres 640 +#define EMask 704 +#define poly_coeff7 768 +#define poly_coeff6 832 +#define poly_coeff5 896 +#define poly_coeff4 960 +#define poly_coeff3 1024 +#define poly_coeff2 1088 +#define One 1152 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_expm1_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups L2E+__svml_dexpm1_data_internal_avx512(%rip), %zmm6 + vmovups Shifter+__svml_dexpm1_data_internal_avx512(%rip), %zmm4 + vmovups L2H+__svml_dexpm1_data_internal_avx512(%rip), %zmm11 + vmovups L2L+__svml_dexpm1_data_internal_avx512(%rip), %zmm5 + vmovups Threshold+__svml_dexpm1_data_internal_avx512(%rip), %zmm3 + vmovups poly_coeff5+__svml_dexpm1_data_internal_avx512(%rip), %zmm13 + vmovups poly_coeff4+__svml_dexpm1_data_internal_avx512(%rip), %zmm15 + +/* polynomial */ + vmovups poly_coeff7+__svml_dexpm1_data_internal_avx512(%rip), %zmm12 + +/* set Z0=max(Z0, -128.0) */ + vmovups ZThres+__svml_dexpm1_data_internal_avx512(%rip), %zmm8 + vmovups poly_coeff3+__svml_dexpm1_data_internal_avx512(%rip), %zmm14 + vmovups __svml_dexpm1_data_internal_avx512(%rip), %zmm9 + vmovaps %zmm0, %zmm2 + +/* 2^(52-4)*1.5 + x * log2(e) */ + vfmadd213pd {rn-sae}, %zmm4, %zmm2, %zmm6 + vmovups Exp_tbl_L+__svml_dexpm1_data_internal_avx512(%rip), %zmm0 + vcmppd $21, {sae}, %zmm3, %zmm2, %k0 + +/* Z0 ~ x*log2(e), rounded to 4 fractional bits */ + vsubpd {rn-sae}, %zmm4, %zmm6, %zmm7 + vpermt2pd Exp_tbl_H+64+__svml_dexpm1_data_internal_avx512(%rip), %zmm6, %zmm9 + vpermt2pd Exp_tbl_L+64+__svml_dexpm1_data_internal_avx512(%rip), %zmm6, %zmm0 + vandpd SgnMask+__svml_dexpm1_data_internal_avx512(%rip), %zmm2, %zmm1 + +/* R = x - Z0*log(2) */ + vfnmadd213pd {rn-sae}, %zmm2, %zmm7, %zmm11 + vmaxpd {sae}, %zmm8, %zmm7, %zmm10 + vfnmadd231pd {rn-sae}, %zmm7, %zmm5, %zmm11 + kmovw %k0, %edx + +/* ensure |R|<2 even for special cases */ + vandpd EMask+__svml_dexpm1_data_internal_avx512(%rip), %zmm11, %zmm3 + vmovups poly_coeff6+__svml_dexpm1_data_internal_avx512(%rip), %zmm11 + +/* scale Th */ + vscalefpd {rn-sae}, %zmm10, %zmm9, %zmm4 + vfmadd231pd {rn-sae}, %zmm3, %zmm13, %zmm15 + vfmadd231pd {rn-sae}, %zmm3, %zmm12, %zmm11 + vmovups poly_coeff2+__svml_dexpm1_data_internal_avx512(%rip), %zmm12 + vmulpd {rn-sae}, %zmm3, %zmm3, %zmm13 + vfmadd231pd {rn-sae}, %zmm3, %zmm14, %zmm12 + vfmadd213pd {rn-sae}, %zmm15, %zmm13, %zmm11 + vfmadd213pd {rn-sae}, %zmm12, %zmm13, %zmm11 + +/* Tlr + R+ R*Poly */ + vfmadd213pd {rn-sae}, %zmm0, %zmm13, %zmm11 + +/* Th - 1 */ + vmovups One+__svml_dexpm1_data_internal_avx512(%rip), %zmm0 + vaddpd {rn-sae}, %zmm3, %zmm11, %zmm14 + vsubpd {rn-sae}, %zmm0, %zmm4, %zmm15 + +/* (Th-1)+Th*(Tlr + R+ R*Poly) */ + vfmadd213pd {rn-sae}, %zmm15, %zmm14, %zmm4 + vorpd %zmm1, %zmm4, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm2, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call expm1@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_expm1_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dexpm1_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Exp_tbl_H[16][2]; + __declspec(align(64)) VUINT32 Exp_tbl_L[16][2]; + __declspec(align(64)) VUINT32 L2E[8][2]; + __declspec(align(64)) VUINT32 Shifter[8][2]; + __declspec(align(64)) VUINT32 Threshold[8][2]; + __declspec(align(64)) VUINT32 SgnMask[8][2]; + __declspec(align(64)) VUINT32 L2H[8][2]; + __declspec(align(64)) VUINT32 L2L[8][2]; + __declspec(align(64)) VUINT32 ZThres[8][2]; + __declspec(align(64)) VUINT32 EMask[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 One[8][2]; + } __svml_dexpm1_data_internal_avx512; +#endif +__svml_dexpm1_data_internal_avx512: + /*== Exp_tbl_H ==*/ + .quad 0x3ff0000000000000 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff172b83c7d517b + .quad 0x3ff2387a6e756238 + .quad 0x3ff306fe0a31b715 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff5ab07dd485429 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff7a11473eb0187 + .quad 0x3ff8ace5422aa0db + .quad 0x3ff9c49182a3f090 + .quad 0x3ffae89f995ad3ad + .quad 0x3ffc199bdd85529c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffea4afa2a490da + /*== Exp_tbl_L ==*/ + .align 64 + .quad 0x0000000000000000 + .quad 0x3c979aa65d837b6d + .quad 0xbc801b15eaa59348 + .quad 0x3c968efde3a8a894 + .quad 0x3c834d754db0abb6 + .quad 0x3c859f48a72a4c6d + .quad 0x3c7690cebb7aafb0 + .quad 0x3c9063e1e21c5409 + .quad 0xbc93b3efbf5e2228 + .quad 0xbc7b32dcb94da51d + .quad 0x3c8db72fc1f0eab4 + .quad 0x3c71affc2b91ce27 + .quad 0x3c8c1a7792cb3387 + .quad 0x3c736eae30af0cb3 + .quad 0x3c74a385a63d07a7 + .quad 0xbc8ff7128fd391f0 + /*== log2(e) ==*/ + .align 64 + .quad 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE + /*== Shifter=2^(52-4)*1.5 ==*/ + .align 64 + .quad 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0, 0x42f8000000003ff0 + /*== Threshold ==*/ + .align 64 + .quad 0x40861DA04CBAFE44, 0x40861DA04CBAFE44, 0x40861DA04CBAFE44, 0x40861DA04CBAFE44, 0x40861DA04CBAFE44, 0x40861DA04CBAFE44, 0x40861DA04CBAFE44, 0x40861DA04CBAFE44 + /*== Sgn ==*/ + .align 64 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .quad 0x3fe62e42fefa39ef, 0x3fe62e42fefa39ef, 0x3fe62e42fefa39ef, 0x3fe62e42fefa39ef, 0x3fe62e42fefa39ef, 0x3fe62e42fefa39ef, 0x3fe62e42fefa39ef, 0x3fe62e42fefa39ef + /*== L2L = log(2)_low ==*/ + .align 64 + .quad 0x3c7abc9e3b39803f, 0x3c7abc9e3b39803f, 0x3c7abc9e3b39803f, 0x3c7abc9e3b39803f, 0x3c7abc9e3b39803f, 0x3c7abc9e3b39803f, 0x3c7abc9e3b39803f, 0x3c7abc9e3b39803f + /*== ZThres ==*/ + .align 64 + .quad 0xc060000000000000, 0xc060000000000000, 0xc060000000000000, 0xc060000000000000, 0xc060000000000000, 0xc060000000000000, 0xc060000000000000, 0xc060000000000000 + /*== EMask ==*/ + .align 64 + .quad 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff, 0xbfffffffffffffff + /*== poly_coeff7 ==*/ + .align 64 + .quad 0x3f2a020410303d8a, 0x3f2a020410303d8a, 0x3f2a020410303d8a, 0x3f2a020410303d8a, 0x3f2a020410303d8a, 0x3f2a020410303d8a, 0x3f2a020410303d8a, 0x3f2a020410303d8a + /*== poly_coeff6 ==*/ + .align 64 + .quad 0x3f56c1c38e164a2f, 0x3f56c1c38e164a2f, 0x3f56c1c38e164a2f, 0x3f56c1c38e164a2f, 0x3f56c1c38e164a2f, 0x3f56c1c38e164a2f, 0x3f56c1c38e164a2f, 0x3f56c1c38e164a2f + /*== poly_coeff5 ==*/ + .align 64 + .quad 0x3f81111110865214, 0x3f81111110865214, 0x3f81111110865214, 0x3f81111110865214, 0x3f81111110865214, 0x3f81111110865214, 0x3f81111110865214, 0x3f81111110865214 + /*== poly_coeff4 ==*/ + .align 64 + .quad 0x3fa5555554ad3d06, 0x3fa5555554ad3d06, 0x3fa5555554ad3d06, 0x3fa5555554ad3d06, 0x3fa5555554ad3d06, 0x3fa5555554ad3d06, 0x3fa5555554ad3d06, 0x3fa5555554ad3d06 + /*== poly_coeff3 ==*/ + .align 64 + .quad 0x3fc5555555555656, 0x3fc5555555555656, 0x3fc5555555555656, 0x3fc5555555555656, 0x3fc5555555555656, 0x3fc5555555555656, 0x3fc5555555555656, 0x3fc5555555555656 + /*== poly_coeff2 ==*/ + .align 64 + .quad 0x3fe00000000000a2, 0x3fe00000000000a2, 0x3fe00000000000a2, 0x3fe00000000000a2, 0x3fe00000000000a2, 0x3fe00000000000a2, 0x3fe00000000000a2, 0x3fe00000000000a2 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + .align 64 + .type __svml_dexpm1_data_internal_avx512,@object + .size __svml_dexpm1_data_internal_avx512,.-__svml_dexpm1_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core-avx2.S new file mode 100644 index 0000000000..a2a8699a05 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized expm1f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_expm1f _ZGVeN16v_expm1f_avx2_wrapper +#include "../svml_s_expm1f16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core.c new file mode 100644 index 0000000000..8007d1e415 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized expm1f, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_expm1f +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_expm1f, __GI__ZGVeN16v_expm1f, + __redirect__ZGVeN16v_expm1f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core_avx512.S new file mode 100644 index 0000000000..5b0dcde77f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f16_core_avx512.S @@ -0,0 +1,281 @@ +/* Function expm1f vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * After computing exp(x) in high-low parts, an accurate computation is performed to obtain exp(x)-1 + * Typical exp() implementation, except that: + * - tables are small (32 elements), allowing for fast gathers + * - all arguments processed in the main path + * - final VSCALEF assists branch-free design (correct overflow/underflow and special case responses) + * - a VAND is used to ensure the reduced argument |R|<2, even for large inputs + * - RZ mode used to avoid oveflow to +/-Inf for x*log2(e); helps with special case handling + * + * + */ + +/* Offsets for data table __svml_sexpm1_data_internal_avx512 + */ +#define Exp_tbl_H 0 +#define Exp_tbl_L 128 +#define L2E 256 +#define Shifter 320 +#define Threshold 384 +#define SgnMask 448 +#define L2H 512 +#define L2L 576 +#define EMask 640 +#define poly_coeff3 704 +#define poly_coeff2 768 +#define One 832 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_expm1f_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups L2E+__svml_sexpm1_data_internal_avx512(%rip), %zmm5 + vmovups Shifter+__svml_sexpm1_data_internal_avx512(%rip), %zmm3 + vmovups L2H+__svml_sexpm1_data_internal_avx512(%rip), %zmm8 + vmovups L2L+__svml_sexpm1_data_internal_avx512(%rip), %zmm4 + vmovups __svml_sexpm1_data_internal_avx512(%rip), %zmm6 + +/* polynomial */ + vmovups poly_coeff3+__svml_sexpm1_data_internal_avx512(%rip), %zmm9 + vmovups poly_coeff2+__svml_sexpm1_data_internal_avx512(%rip), %zmm12 + vmovups Exp_tbl_L+__svml_sexpm1_data_internal_avx512(%rip), %zmm11 + vmovups Threshold+__svml_sexpm1_data_internal_avx512(%rip), %zmm2 + +/* Th - 1 */ + vmovups One+__svml_sexpm1_data_internal_avx512(%rip), %zmm14 + vmovaps %zmm0, %zmm1 + +/* 2^(52-5)*1.5 + x * log2(e) */ + vfmadd213ps {rn-sae}, %zmm3, %zmm1, %zmm5 + vcmpps $29, {sae}, %zmm2, %zmm1, %k0 + +/* Z0 ~ x*log2(e), rounded to 5 fractional bits */ + vsubps {rn-sae}, %zmm3, %zmm5, %zmm7 + vpermt2ps Exp_tbl_H+64+__svml_sexpm1_data_internal_avx512(%rip), %zmm5, %zmm6 + vpermt2ps Exp_tbl_L+64+__svml_sexpm1_data_internal_avx512(%rip), %zmm5, %zmm11 + vandps SgnMask+__svml_sexpm1_data_internal_avx512(%rip), %zmm1, %zmm0 + +/* R = x - Z0*log(2) */ + vfnmadd213ps {rn-sae}, %zmm1, %zmm7, %zmm8 + +/* scale Th */ + vscalefps {rn-sae}, %zmm7, %zmm6, %zmm2 + vfnmadd231ps {rn-sae}, %zmm7, %zmm4, %zmm8 + kmovw %k0, %edx + +/* ensure |R|<2 even for special cases */ + vandps EMask+__svml_sexpm1_data_internal_avx512(%rip), %zmm8, %zmm13 + vsubps {rn-sae}, %zmm14, %zmm2, %zmm8 + vmulps {rn-sae}, %zmm13, %zmm13, %zmm10 + vfmadd231ps {rn-sae}, %zmm13, %zmm9, %zmm12 + +/* Tlr + R+ R2*Poly */ + vfmadd213ps {rn-sae}, %zmm11, %zmm10, %zmm12 + vaddps {rn-sae}, %zmm13, %zmm12, %zmm15 + +/* (Th-1)+Th*(Tlr + R+ R*Poly) */ + vfmadd213ps {rn-sae}, %zmm8, %zmm15, %zmm2 + vorps %zmm0, %zmm2, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm1, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call expm1f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_expm1f_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_sexpm1_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Exp_tbl_H[32][1]; + __declspec(align(64)) VUINT32 Exp_tbl_L[32][1]; + __declspec(align(64)) VUINT32 L2E[16][1]; + __declspec(align(64)) VUINT32 Shifter[16][1]; + __declspec(align(64)) VUINT32 Threshold[16][1]; + __declspec(align(64)) VUINT32 SgnMask[16][1]; + __declspec(align(64)) VUINT32 L2H[16][1]; + __declspec(align(64)) VUINT32 L2L[16][1]; + __declspec(align(64)) VUINT32 EMask[16][1]; + __declspec(align(64)) VUINT32 poly_coeff3[16][1]; + __declspec(align(64)) VUINT32 poly_coeff2[16][1]; + __declspec(align(64)) VUINT32 One[16][1]; + } __svml_sexpm1_data_internal_avx512; +#endif +__svml_sexpm1_data_internal_avx512: + /*== Exp_tbl_H ==*/ + .long 0x3f800000, 0x3f82cd87, 0x3f85aac3, 0x3f88980f + .long 0x3f8b95c2, 0x3f8ea43a, 0x3f91c3d3, 0x3f94f4f0 + .long 0x3f9837f0, 0x3f9b8d3a, 0x3f9ef532, 0x3fa27043 + .long 0x3fa5fed7, 0x3fa9a15b, 0x3fad583f, 0x3fb123f6 + .long 0x3fb504f3, 0x3fb8fbaf, 0x3fbd08a4, 0x3fc12c4d + .long 0x3fc5672a, 0x3fc9b9be, 0x3fce248c, 0x3fd2a81e + .long 0x3fd744fd, 0x3fdbfbb8, 0x3fe0ccdf, 0x3fe5b907 + .long 0x3feac0c7, 0x3fefe4ba, 0x3ff5257d, 0x3ffa83b3 + /*== Exp_tbl_L ==*/ + .align 64 + .long 0x00000000, 0xb34a3a0a, 0x3346cb6a, 0xb36ed17e + .long 0xb24e0611, 0xb3517dd9, 0x334b2482, 0xb31586de + .long 0x33092801, 0xb2e6f467, 0x331b85f2, 0x3099b6f1 + .long 0xb3051aa8, 0xb2e2a0da, 0xb2006c56, 0xb3365942 + .long 0x329302ae, 0x32c595dc, 0xb302e5a2, 0xb28e10a1 + .long 0x31b3d0e5, 0xb31a472b, 0x31d1daf2, 0xb305bf64 + .long 0xb27ce182, 0xb2f26443, 0xb1b4b0da, 0xb1da8a8f + .long 0xb1d290be, 0xb2d5b899, 0x31b0a147, 0xb2156afc + /*== log2(e) ==*/ + .align 64 + .long 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B + /*== Shifter=2^(23-5)*1.5 ==*/ + .align 64 + .long 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000, 0x48c00000 + /*== Threshold ==*/ + .align 64 + .long 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B + /*== Sgn ==*/ + .align 64 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + /*== L2L = log(2)_low ==*/ + .align 64 + .long 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308, 0xb102e308 + /*== EMask ==*/ + .align 64 + .long 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff, 0xbfffffff + /*== poly_coeff3 ==*/ + .align 64 + .long 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3, 0x3e2AABF3 + /*== poly_coeff2 ==*/ + .align 64 + .long 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6, 0x3f0000F6 + /*== One ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + .align 64 + .type __svml_sexpm1_data_internal_avx512,@object + .size __svml_sexpm1_data_internal_avx512,.-__svml_sexpm1_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core-sse2.S new file mode 100644 index 0000000000..b4dbb77590 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized expm1f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_expm1f _ZGVbN4v_expm1f_sse2 +#include "../svml_s_expm1f4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core.c new file mode 100644 index 0000000000..f8ef12511d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized expm1f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_expm1f +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_expm1f, __GI__ZGVbN4v_expm1f, + __redirect__ZGVbN4v_expm1f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core_sse4.S new file mode 100644 index 0000000000..18770f6dbb --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f4_core_sse4.S @@ -0,0 +1,358 @@ +/* Function expm1f vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * N = (int)(x*2^k/log(2.0)), R = x - N*log(2)/2^k + * exp(x) = 2^(N/2^k) * poly(R) is computed in high-low parts + * expm1(x) = exp(x)-1 is then obtained via multi-precision computation + * + * + */ + +/* Offsets for data table __svml_sexpm1_data_internal + */ +#define Expm1_HA_table 0 +#define poly_coeff 512 +#define Log2e 576 +#define L2H 592 +#define L2L 608 +#define ExpAddConst 624 +#define IndexMask 640 +#define ExpMask 656 +#define MOne 672 +#define AbsMask 688 +#define Threshold 704 +#define L2 720 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_expm1f_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + movaps %xmm0, %xmm4 + movups Log2e+__svml_sexpm1_data_internal(%rip), %xmm9 + lea __svml_sexpm1_data_internal(%rip), %r8 + mulps %xmm0, %xmm9 + movups .FLT_10(%rip), %xmm5 + movups ExpAddConst+__svml_sexpm1_data_internal(%rip), %xmm2 + addps %xmm5, %xmm9 + +/* argument reduction */ + movups L2H+__svml_sexpm1_data_internal(%rip), %xmm6 + subps %xmm5, %xmm9 + mulps %xmm9, %xmm6 + addps %xmm9, %xmm2 + +/* table lookup */ + movdqu IndexMask+__svml_sexpm1_data_internal(%rip), %xmm12 + subps %xmm6, %xmm4 + pand %xmm2, %xmm12 + movups L2L+__svml_sexpm1_data_internal(%rip), %xmm7 + movups AbsMask+__svml_sexpm1_data_internal(%rip), %xmm3 + pshufd $1, %xmm12, %xmm10 + movaps %xmm3, %xmm8 + mulps %xmm9, %xmm7 + andps %xmm0, %xmm8 + cmpnleps Threshold+__svml_sexpm1_data_internal(%rip), %xmm8 + movd %xmm12, %edx + subps %xmm7, %xmm4 + movd %xmm10, %ecx + movmskps %xmm8, %eax + pshufd $2, %xmm12, %xmm11 + movaps %xmm4, %xmm7 + pshufd $3, %xmm12, %xmm13 + andnps %xmm0, %xmm3 + movd %xmm11, %esi + movd %xmm13, %edi + +/* polynomial */ + movups poly_coeff+__svml_sexpm1_data_internal(%rip), %xmm8 + movdqu ExpMask+__svml_sexpm1_data_internal(%rip), %xmm6 + movslq %edx, %rdx + pand %xmm6, %xmm2 + movslq %ecx, %rcx + pslld $14, %xmm2 + movslq %esi, %rsi + movslq %edi, %rdi + movq (%r8,%rdx), %xmm1 + movq (%r8,%rcx), %xmm14 + movq (%r8,%rsi), %xmm5 + movq (%r8,%rdi), %xmm15 + unpcklps %xmm14, %xmm1 + mulps %xmm4, %xmm8 + movaps %xmm1, %xmm10 + mulps %xmm4, %xmm7 + addps poly_coeff+16+__svml_sexpm1_data_internal(%rip), %xmm8 + unpcklps %xmm15, %xmm5 + movlhps %xmm5, %xmm10 + shufps $238, %xmm5, %xmm1 + orps %xmm2, %xmm10 + +/* T-1 */ + movups MOne+__svml_sexpm1_data_internal(%rip), %xmm9 + mulps %xmm2, %xmm1 + addps %xmm9, %xmm10 + mulps %xmm7, %xmm8 + addps %xmm1, %xmm10 + addps %xmm8, %xmm4 + movaps %xmm10, %xmm1 + subps %xmm9, %xmm1 + mulps %xmm1, %xmm4 + addps %xmm4, %xmm10 + orps %xmm3, %xmm10 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax xmm0 xmm10 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm10, %xmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm10, 48(%rsp) + # LOE rbx r12 r13 r14 r15 eax + + xorl %edx, %edx + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm10 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm10 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call expm1f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN4v_expm1f_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_sexpm1_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Expm1_HA_table[(1<<7)][1]; + __declspec(align(16)) VUINT32 poly_coeff[4][4][1]; + __declspec(align(16)) VUINT32 Log2e[4][1]; + __declspec(align(16)) VUINT32 L2H[4][1]; + __declspec(align(16)) VUINT32 L2L[4][1]; + __declspec(align(16)) VUINT32 ExpAddConst[4][1]; + __declspec(align(16)) VUINT32 IndexMask[4][1]; + __declspec(align(16)) VUINT32 ExpMask[4][1]; + __declspec(align(16)) VUINT32 MOne[4][1]; + __declspec(align(16)) VUINT32 AbsMask[4][1]; + __declspec(align(16)) VUINT32 Threshold[4][1]; + __declspec(align(16)) VUINT32 L2[4][1]; +} __svml_sexpm1_data_internal; +#endif +__svml_sexpm1_data_internal: + /* Expm1_HA_table */ + .long 0x00000000, 0x00000000 + .long 0x00016000, 0x391a3e78 + .long 0x0002d000, 0xb89e59d5 + .long 0x00044000, 0xb93ae78a + .long 0x0005b000, 0xb9279306 + .long 0x00072000, 0xb79e6961 + .long 0x0008a000, 0xb97e2fee + .long 0x000a1000, 0x391aaea9 + .long 0x000b9000, 0x39383c7d + .long 0x000d2000, 0xb9241490 + .long 0x000ea000, 0x39073169 + .long 0x00103000, 0x386e218a + .long 0x0011c000, 0x38f4dceb + .long 0x00136000, 0xb93a9a1e + .long 0x0014f000, 0x391df520 + .long 0x00169000, 0x3905a6e4 + .long 0x00183000, 0x397e0a32 + .long 0x0019e000, 0x370b2641 + .long 0x001b9000, 0xb8b1918b + .long 0x001d4000, 0xb8132c6a + .long 0x001ef000, 0x39264c12 + .long 0x0020b000, 0x37221f73 + .long 0x00227000, 0x37060619 + .long 0x00243000, 0x3922b5c1 + .long 0x00260000, 0xb814ab27 + .long 0x0027d000, 0xb89b12c6 + .long 0x0029a000, 0x382d5a75 + .long 0x002b8000, 0xb938c94b + .long 0x002d6000, 0xb97822b8 + .long 0x002f4000, 0xb910ea53 + .long 0x00312000, 0x38fd6075 + .long 0x00331000, 0x38620955 + .long 0x00350000, 0x391e667f + .long 0x00370000, 0xb89b8736 + .long 0x00390000, 0xb90a1714 + .long 0x003b0000, 0xb7a54ded + .long 0x003d1000, 0xb96b8c15 + .long 0x003f1000, 0x397336cf + .long 0x00413000, 0xb8eccd66 + .long 0x00434000, 0x39599b45 + .long 0x00456000, 0x3965422b + .long 0x00479000, 0xb8a2cdd5 + .long 0x0049c000, 0xb9484f32 + .long 0x004bf000, 0xb8fac043 + .long 0x004e2000, 0x391182a4 + .long 0x00506000, 0x38ccf6bc + .long 0x0052b000, 0xb97c4dc2 + .long 0x0054f000, 0x38d6aaf4 + .long 0x00574000, 0x391f995b + .long 0x0059a000, 0xb8ba8f62 + .long 0x005c0000, 0xb9090d05 + .long 0x005e6000, 0x37f4825e + .long 0x0060d000, 0xb8c844f5 + .long 0x00634000, 0xb76d1a83 + .long 0x0065c000, 0xb95f2310 + .long 0x00684000, 0xb952b5f8 + .long 0x006ac000, 0x37c6e7dd + .long 0x006d5000, 0xb7cfe126 + .long 0x006fe000, 0x3917337c + .long 0x00728000, 0x383b9e2d + .long 0x00752000, 0x392fa2a5 + .long 0x0077d000, 0x37df730b + .long 0x007a8000, 0x38ecb6dd + .long 0x007d4000, 0xb879f986 + /*== poly_coeff[4] ==*/ + .align 16 + .long 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF /* coeff3 */ + .long 0x3f00000F, 0x3f00000F, 0x3f00000F, 0x3f00000F /* coeff2 */ + /* 32 Byte Padding */ + .zero 32 + /*== Log2e ==*/ + .align 16 + .long 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B + /*== L2H ==*/ + .align 16 + .long 0x3c318000, 0x3c318000, 0x3c318000, 0x3c318000 + /*== L2L ==*/ + .align 16 + .long 0xb65e8083, 0xb65e8083, 0xb65e8083, 0xb65e8083 + /*== ExpAddConst ==*/ + .align 16 + .long 0x49f0fe00, 0x49f0fe00, 0x49f0fe00, 0x49f0fe00 + /*== IndexMask ==*/ + .align 16 + .long 0x000001f8, 0x000001f8, 0x000001f8, 0x000001f8 + /*== ExpMask ==*/ + .align 16 + .long 0x0001fe00, 0x0001fe00, 0x0001fe00, 0x0001fe00 + /*== MOne ==*/ + .align 16 + .long 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000 + /*== AbsMask ==*/ + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== Threshold ==*/ + .align 16 + .long 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B // 86.643394 + /*== L2 ==*/ + .align 16 + .long 0x3cb17218, 0x3cb17218, 0x3cb17218, 0x3cb17218 + .align 16 + .type __svml_sexpm1_data_internal,@object + .size __svml_sexpm1_data_internal,.-__svml_sexpm1_data_internal + .align 16 + +.FLT_10: + .long 0x4b400000,0x4b400000,0x4b400000,0x4b400000 + .type .FLT_10,@object + .size .FLT_10,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core-sse.S new file mode 100644 index 0000000000..e34e4eb8d0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized expm1f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_expm1f _ZGVdN8v_expm1f_sse_wrapper +#include "../svml_s_expm1f8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core.c new file mode 100644 index 0000000000..7e8b57de30 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized expm1f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_expm1f +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_expm1f, __GI__ZGVdN8v_expm1f, + __redirect__ZGVdN8v_expm1f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core_avx2.S new file mode 100644 index 0000000000..8e65d692d6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_expm1f8_core_avx2.S @@ -0,0 +1,351 @@ +/* Function expm1f vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * N = (int)(x*2^k/log(2.0)), R = x - N*log(2)/2^k + * exp(x) = 2^(N/2^k) * poly(R) is computed in high-low parts + * expm1(x) = exp(x)-1 is then obtained via multi-precision computation + * + * + */ + +/* Offsets for data table __svml_sexpm1_data_internal + */ +#define Expm1_HA_table 0 +#define poly_coeff 512 +#define Log2e 640 +#define L2H 672 +#define L2L 704 +#define ExpAddConst 736 +#define IndexMask 768 +#define ExpMask 800 +#define MOne 832 +#define AbsMask 864 +#define Threshold 896 +#define L2 928 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_expm1f_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea __svml_sexpm1_data_internal(%rip), %rax + vmovaps %ymm0, %ymm3 + vmulps Log2e+__svml_sexpm1_data_internal(%rip), %ymm3, %ymm4 + +/* argument reduction */ + vmovups L2H+__svml_sexpm1_data_internal(%rip), %ymm2 + vmovups AbsMask+__svml_sexpm1_data_internal(%rip), %ymm5 + vroundps $0, %ymm4, %ymm8 + vaddps ExpAddConst+__svml_sexpm1_data_internal(%rip), %ymm8, %ymm0 + vfnmadd213ps %ymm3, %ymm8, %ymm2 + +/* table lookup */ + vandps IndexMask+__svml_sexpm1_data_internal(%rip), %ymm0, %ymm9 + vandps %ymm5, %ymm3, %ymm6 + vcmpnle_uqps Threshold+__svml_sexpm1_data_internal(%rip), %ymm6, %ymm7 + vfnmadd231ps L2L+__svml_sexpm1_data_internal(%rip), %ymm8, %ymm2 + vandps ExpMask+__svml_sexpm1_data_internal(%rip), %ymm0, %ymm0 + vandnps %ymm3, %ymm5, %ymm1 + vpslld $14, %ymm0, %ymm0 + vmovmskps %ymm7, %edx + vmovd %xmm9, %ecx + vextractf128 $1, %ymm9, %xmm10 + movslq %ecx, %rcx + vmovd %xmm10, %r9d + vpextrd $1, %xmm9, %esi + vpextrd $2, %xmm9, %edi + vpextrd $3, %xmm9, %r8d + vmovq (%rax,%rcx), %xmm11 + vpextrd $1, %xmm10, %r10d + vpextrd $2, %xmm10, %r11d + vpextrd $3, %xmm10, %ecx + movslq %esi, %rsi + movslq %edi, %rdi + movslq %r8d, %r8 + movslq %r9d, %r9 + movslq %r10d, %r10 + movslq %r11d, %r11 + movslq %ecx, %rcx + vmovq (%rax,%rsi), %xmm13 + vmovq (%rax,%rdi), %xmm12 + vmovq (%rax,%r8), %xmm14 + vmovq (%rax,%r9), %xmm15 + vmovq (%rax,%r10), %xmm5 + vmovq (%rax,%r11), %xmm4 + vmovq (%rax,%rcx), %xmm6 + vunpcklps %xmm12, %xmm11, %xmm7 + vunpcklps %xmm14, %xmm13, %xmm8 + vunpcklps %xmm4, %xmm15, %xmm15 + vunpcklps %xmm6, %xmm5, %xmm9 + vmulps %ymm2, %ymm2, %ymm13 + vinsertf128 $1, %xmm15, %ymm7, %ymm10 + vinsertf128 $1, %xmm9, %ymm8, %ymm11 + vunpcklps %ymm11, %ymm10, %ymm12 + vorps %ymm0, %ymm12, %ymm14 + +/* polynomial */ + vmovups poly_coeff+__svml_sexpm1_data_internal(%rip), %ymm12 + vfmadd213ps poly_coeff+32+__svml_sexpm1_data_internal(%rip), %ymm2, %ymm12 + vfmadd213ps %ymm2, %ymm13, %ymm12 + +/* T-1 */ + vmovups MOne+__svml_sexpm1_data_internal(%rip), %ymm13 + vaddps %ymm13, %ymm14, %ymm2 + vunpckhps %ymm11, %ymm10, %ymm4 + vfmadd213ps %ymm2, %ymm0, %ymm4 + vsubps %ymm13, %ymm4, %ymm0 + vfmadd213ps %ymm4, %ymm12, %ymm0 + vorps %ymm1, %ymm0, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm3, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call expm1f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_expm1f_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_sexpm1_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Expm1_HA_table[(1<<7)][1]; + __declspec(align(32)) VUINT32 poly_coeff[4][8][1]; + __declspec(align(32)) VUINT32 Log2e[8][1]; + __declspec(align(32)) VUINT32 L2H[8][1]; + __declspec(align(32)) VUINT32 L2L[8][1]; + __declspec(align(32)) VUINT32 ExpAddConst[8][1]; + __declspec(align(32)) VUINT32 IndexMask[8][1]; + __declspec(align(32)) VUINT32 ExpMask[8][1]; + __declspec(align(32)) VUINT32 MOne[8][1]; + __declspec(align(32)) VUINT32 AbsMask[8][1]; + __declspec(align(32)) VUINT32 Threshold[8][1]; + __declspec(align(32)) VUINT32 L2[8][1]; +} __svml_sexpm1_data_internal; +#endif +__svml_sexpm1_data_internal: + /* Expm1_HA_table */ + .long 0x00000000, 0x00000000 + .long 0x00016000, 0x391a3e78 + .long 0x0002d000, 0xb89e59d5 + .long 0x00044000, 0xb93ae78a + .long 0x0005b000, 0xb9279306 + .long 0x00072000, 0xb79e6961 + .long 0x0008a000, 0xb97e2fee + .long 0x000a1000, 0x391aaea9 + .long 0x000b9000, 0x39383c7d + .long 0x000d2000, 0xb9241490 + .long 0x000ea000, 0x39073169 + .long 0x00103000, 0x386e218a + .long 0x0011c000, 0x38f4dceb + .long 0x00136000, 0xb93a9a1e + .long 0x0014f000, 0x391df520 + .long 0x00169000, 0x3905a6e4 + .long 0x00183000, 0x397e0a32 + .long 0x0019e000, 0x370b2641 + .long 0x001b9000, 0xb8b1918b + .long 0x001d4000, 0xb8132c6a + .long 0x001ef000, 0x39264c12 + .long 0x0020b000, 0x37221f73 + .long 0x00227000, 0x37060619 + .long 0x00243000, 0x3922b5c1 + .long 0x00260000, 0xb814ab27 + .long 0x0027d000, 0xb89b12c6 + .long 0x0029a000, 0x382d5a75 + .long 0x002b8000, 0xb938c94b + .long 0x002d6000, 0xb97822b8 + .long 0x002f4000, 0xb910ea53 + .long 0x00312000, 0x38fd6075 + .long 0x00331000, 0x38620955 + .long 0x00350000, 0x391e667f + .long 0x00370000, 0xb89b8736 + .long 0x00390000, 0xb90a1714 + .long 0x003b0000, 0xb7a54ded + .long 0x003d1000, 0xb96b8c15 + .long 0x003f1000, 0x397336cf + .long 0x00413000, 0xb8eccd66 + .long 0x00434000, 0x39599b45 + .long 0x00456000, 0x3965422b + .long 0x00479000, 0xb8a2cdd5 + .long 0x0049c000, 0xb9484f32 + .long 0x004bf000, 0xb8fac043 + .long 0x004e2000, 0x391182a4 + .long 0x00506000, 0x38ccf6bc + .long 0x0052b000, 0xb97c4dc2 + .long 0x0054f000, 0x38d6aaf4 + .long 0x00574000, 0x391f995b + .long 0x0059a000, 0xb8ba8f62 + .long 0x005c0000, 0xb9090d05 + .long 0x005e6000, 0x37f4825e + .long 0x0060d000, 0xb8c844f5 + .long 0x00634000, 0xb76d1a83 + .long 0x0065c000, 0xb95f2310 + .long 0x00684000, 0xb952b5f8 + .long 0x006ac000, 0x37c6e7dd + .long 0x006d5000, 0xb7cfe126 + .long 0x006fe000, 0x3917337c + .long 0x00728000, 0x383b9e2d + .long 0x00752000, 0x392fa2a5 + .long 0x0077d000, 0x37df730b + .long 0x007a8000, 0x38ecb6dd + .long 0x007d4000, 0xb879f986 + /*== poly_coeff[4] ==*/ + .align 32 + .long 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF, 0x3e2AAABF /* coeff3 */ + .long 0x3f00000F, 0x3f00000F, 0x3f00000F, 0x3f00000F, 0x3f00000F, 0x3f00000F, 0x3f00000F, 0x3f00000F /* coeff2 */ + /* 64 Byte Padding */ + .zero 64 + /*== Log2e ==*/ + .align 32 + .long 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B, 0x42B8AA3B + /*== L2H ==*/ + .align 32 + .long 0x3c318000, 0x3c318000, 0x3c318000, 0x3c318000, 0x3c318000, 0x3c318000, 0x3c318000, 0x3c318000 + /*== L2L ==*/ + .align 32 + .long 0xb65e8083, 0xb65e8083, 0xb65e8083, 0xb65e8083, 0xb65e8083, 0xb65e8083, 0xb65e8083, 0xb65e8083 + /*== ExpAddConst ==*/ + .align 32 + .long 0x49f0fe00, 0x49f0fe00, 0x49f0fe00, 0x49f0fe00, 0x49f0fe00, 0x49f0fe00, 0x49f0fe00, 0x49f0fe00 + /*== IndexMask ==*/ + .align 32 + .long 0x000001f8, 0x000001f8, 0x000001f8, 0x000001f8, 0x000001f8, 0x000001f8, 0x000001f8, 0x000001f8 + /*== ExpMask ==*/ + .align 32 + .long 0x0001fe00, 0x0001fe00, 0x0001fe00, 0x0001fe00, 0x0001fe00, 0x0001fe00, 0x0001fe00, 0x0001fe00 + /*== MOne ==*/ + .align 32 + .long 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000 + /*== AbsMask ==*/ + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== Threshold ==*/ + .align 32 + .long 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B, 0x42AD496B // 86.643394 + /*== L2 ==*/ + .align 32 + .long 0x3cb17218, 0x3cb17218, 0x3cb17218, 0x3cb17218, 0x3cb17218, 0x3cb17218, 0x3cb17218, 0x3cb17218 + .align 32 + .type __svml_sexpm1_data_internal,@object + .size __svml_sexpm1_data_internal,.-__svml_sexpm1_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_expm12_core.S b/sysdeps/x86_64/fpu/svml_d_expm12_core.S new file mode 100644 index 0000000000..a725d614bd --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_expm12_core.S @@ -0,0 +1,29 @@ +/* Function expm1 vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_expm1) +WRAPPER_IMPL_SSE2 expm1 +END (_ZGVbN2v_expm1) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_expm1) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_expm14_core.S b/sysdeps/x86_64/fpu/svml_d_expm14_core.S new file mode 100644 index 0000000000..1027def883 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_expm14_core.S @@ -0,0 +1,29 @@ +/* Function expm1 vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_expm1) +WRAPPER_IMPL_AVX _ZGVbN2v_expm1 +END (_ZGVdN4v_expm1) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_expm1) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_expm14_core_avx.S b/sysdeps/x86_64/fpu/svml_d_expm14_core_avx.S new file mode 100644 index 0000000000..3a34262241 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_expm14_core_avx.S @@ -0,0 +1,25 @@ +/* Function expm1 vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_expm1) +WRAPPER_IMPL_AVX _ZGVbN2v_expm1 +END (_ZGVcN4v_expm1) diff --git a/sysdeps/x86_64/fpu/svml_d_expm18_core.S b/sysdeps/x86_64/fpu/svml_d_expm18_core.S new file mode 100644 index 0000000000..fa97595665 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_expm18_core.S @@ -0,0 +1,25 @@ +/* Function expm1 vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_expm1) +WRAPPER_IMPL_AVX512 _ZGVdN4v_expm1 +END (_ZGVeN8v_expm1) diff --git a/sysdeps/x86_64/fpu/svml_s_expm1f16_core.S b/sysdeps/x86_64/fpu/svml_s_expm1f16_core.S new file mode 100644 index 0000000000..b7423632a9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_expm1f16_core.S @@ -0,0 +1,25 @@ +/* Function expm1f vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_expm1f) +WRAPPER_IMPL_AVX512 _ZGVdN8v_expm1f +END (_ZGVeN16v_expm1f) diff --git a/sysdeps/x86_64/fpu/svml_s_expm1f4_core.S b/sysdeps/x86_64/fpu/svml_s_expm1f4_core.S new file mode 100644 index 0000000000..334a49133a --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_expm1f4_core.S @@ -0,0 +1,29 @@ +/* Function expm1f vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_expm1f) +WRAPPER_IMPL_SSE2 expm1f +END (_ZGVbN4v_expm1f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_expm1f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_expm1f8_core.S b/sysdeps/x86_64/fpu/svml_s_expm1f8_core.S new file mode 100644 index 0000000000..10589574a5 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_expm1f8_core.S @@ -0,0 +1,29 @@ +/* Function expm1f vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_expm1f) +WRAPPER_IMPL_AVX _ZGVbN4v_expm1f +END (_ZGVdN8v_expm1f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_expm1f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_expm1f8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_expm1f8_core_avx.S new file mode 100644 index 0000000000..4161113615 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_expm1f8_core_avx.S @@ -0,0 +1,25 @@ +/* Function expm1f vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_expm1f) +WRAPPER_IMPL_AVX _ZGVbN4v_expm1f +END (_ZGVcN8v_expm1f) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx.c new file mode 100644 index 0000000000..3e59cb7141 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-expm1.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx2.c new file mode 100644 index 0000000000..3e59cb7141 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-expm1.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx512f.c new file mode 100644 index 0000000000..3e59cb7141 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-expm1-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-expm1.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-expm1.c b/sysdeps/x86_64/fpu/test-double-libmvec-expm1.c new file mode 100644 index 0000000000..33806a78c8 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-expm1.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC expm1 +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 68c449e04a..0222f9f5b8 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVbN2vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVbN2v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVbN2v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVbN2v_cosh) +VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVbN2v_expm1) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index df67306373..1aad9faf9c 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVdN4vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVdN4v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVdN4v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVdN4v_cosh) +VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVdN4v_expm1) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 1a6731098f..e404bf899d 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVcN4vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVcN4v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVcN4v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVcN4v_cosh) +VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVcN4v_expm1) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 4cdfa918e8..2b4de59343 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypot), _ZGVeN8vv_hypot) VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVeN8v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVeN8v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVeN8v_cosh) +VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVeN8v_expm1) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx.c new file mode 100644 index 0000000000..67e31f9666 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-expm1f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx2.c new file mode 100644 index 0000000000..67e31f9666 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-expm1f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx512f.c new file mode 100644 index 0000000000..67e31f9666 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-expm1f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-expm1f.c b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f.c new file mode 100644 index 0000000000..aa9871a39d --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-expm1f.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC expm1f +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 47a9862233..9a4a1b84a9 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVeN16vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVeN16v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVeN16v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVeN16v_coshf) +VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVeN16v_expm1f) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index e7c5410e7b..eb4e36d0e2 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVbN4vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVbN4v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVbN4v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVbN4v_coshf) +VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVbN4v_expm1f) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index b8e9d48cd6..d8adab59e6 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVdN8vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVdN8v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVdN8v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVdN8v_coshf) +VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVdN8v_expm1f) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 328c827b27..e6e1a90c72 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -34,6 +34,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (hypotf), _ZGVcN8vv_hypotf) VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVcN8v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVcN8v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVcN8v_coshf) +VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVcN8v_expm1f) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573822 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=SaZT5mq2; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmFS4k4fz9sVq for ; Wed, 29 Dec 2021 07:23:24 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A441F3858438 for ; Tue, 28 Dec 2021 20:23:21 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A441F3858438 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640723001; bh=MzczrsKKH7pk1Pb6aay3iRxE8tV2UwH9JyewI4YYu9Q=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=SaZT5mq2nh3yK1yMw6y8PgHc5eiaE3gGOoJgIPeT5UPKiUnnLGBApWmlgQovkzlDU CMMekNahl3gxkxZPjqUmvkfBR3xgLk9IoYP9IRE16FJbJmJYXwFK7nIeoZORrkEtgi FXNfHshxpjiTVJC/8hN2drgEsk3tfQy8K+H0QKD0= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by sourceware.org (Postfix) with ESMTPS id E0A69385843D for ; Tue, 28 Dec 2021 20:11:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E0A69385843D X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="238958493" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="238958493" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="468218873" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga003.jf.intel.com with ESMTP; 28 Dec 2021 12:11:31 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsc016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 08/18] x86-64: Add vector sinh/sinhf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:20 -0800 Message-Id: <20211228201130.737370-9-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized sinh/sinhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector sinh/sinhf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_sinh2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_sinh2_core.c | 27 + .../fpu/multiarch/svml_d_sinh2_core_sse4.S | 456 +++++++++++++++++ .../fpu/multiarch/svml_d_sinh4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_sinh4_core.c | 27 + .../fpu/multiarch/svml_d_sinh4_core_avx2.S | 470 ++++++++++++++++++ .../fpu/multiarch/svml_d_sinh8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_sinh8_core.c | 27 + .../fpu/multiarch/svml_d_sinh8_core_avx512.S | 461 +++++++++++++++++ .../fpu/multiarch/svml_s_sinhf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_sinhf16_core.c | 28 ++ .../multiarch/svml_s_sinhf16_core_avx512.S | 318 ++++++++++++ .../fpu/multiarch/svml_s_sinhf4_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_s_sinhf4_core.c | 28 ++ .../fpu/multiarch/svml_s_sinhf4_core_sse4.S | 308 ++++++++++++ .../fpu/multiarch/svml_s_sinhf8_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_s_sinhf8_core.c | 28 ++ .../fpu/multiarch/svml_s_sinhf8_core_avx2.S | 309 ++++++++++++ sysdeps/x86_64/fpu/svml_d_sinh2_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_sinh4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_sinh4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_sinh8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_sinhf16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_sinhf4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_sinhf8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_sinhf8_core_avx.S | 25 + .../x86_64/fpu/test-double-libmvec-sinh-avx.c | 1 + .../fpu/test-double-libmvec-sinh-avx2.c | 1 + .../fpu/test-double-libmvec-sinh-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-sinh.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-sinhf-avx.c | 1 + .../fpu/test-float-libmvec-sinhf-avx2.c | 1 + .../fpu/test-float-libmvec-sinhf-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-sinhf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 2894 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_sinh2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_sinh4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_sinh4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_sinh8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_sinhf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_sinhf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_sinhf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_sinhf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-sinh.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-sinhf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 28dc4a82c5..6347320521 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -186,4 +186,15 @@ #define __DECL_SIMD_expm1f32x #define __DECL_SIMD_expm1f64x #define __DECL_SIMD_expm1f128x + +#define __DECL_SIMD_sinh +#define __DECL_SIMD_sinhf +#define __DECL_SIMD_sinhl +#define __DECL_SIMD_sinhf16 +#define __DECL_SIMD_sinhf32 +#define __DECL_SIMD_sinhf64 +#define __DECL_SIMD_sinhf128 +#define __DECL_SIMD_sinhf32x +#define __DECL_SIMD_sinhf64x +#define __DECL_SIMD_sinhf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index c57adc8ace..673b3a93ba 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -70,7 +70,7 @@ __MATHCALL (tan,, (_Mdouble_ __x)); /* Hyperbolic cosine of X. */ __MATHCALL_VEC (cosh,, (_Mdouble_ __x)); /* Hyperbolic sine of X. */ -__MATHCALL (sinh,, (_Mdouble_ __x)); +__MATHCALL_VEC (sinh,, (_Mdouble_ __x)); /* Hyperbolic tangent of X. */ __MATHCALL (tanh,, (_Mdouble_ __x)); diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index c9d3213bd3..f9d7b085ab 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -53,6 +53,7 @@ GLIBC_2.35 _ZGVbN2v_cosh F GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2v_expm1 F +GLIBC_2.35 _ZGVbN2v_sinh F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F @@ -61,6 +62,7 @@ GLIBC_2.35 _ZGVbN4v_coshf F GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4v_expm1f F +GLIBC_2.35 _ZGVbN4v_sinhf F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F @@ -69,6 +71,7 @@ GLIBC_2.35 _ZGVcN4v_cosh F GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4v_expm1 F +GLIBC_2.35 _ZGVcN4v_sinh F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F @@ -77,6 +80,7 @@ GLIBC_2.35 _ZGVcN8v_coshf F GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8v_expm1f F +GLIBC_2.35 _ZGVcN8v_sinhf F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F @@ -85,6 +89,7 @@ GLIBC_2.35 _ZGVdN4v_cosh F GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4v_expm1 F +GLIBC_2.35 _ZGVdN4v_sinh F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F @@ -93,6 +98,7 @@ GLIBC_2.35 _ZGVdN8v_coshf F GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8v_expm1f F +GLIBC_2.35 _ZGVdN8v_sinhf F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F @@ -101,6 +107,7 @@ GLIBC_2.35 _ZGVeN16v_coshf F GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16v_expm1f F +GLIBC_2.35 _ZGVeN16v_sinhf F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F @@ -109,4 +116,5 @@ GLIBC_2.35 _ZGVeN8v_cosh F GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8v_expm1 F +GLIBC_2.35 _ZGVeN8v_sinh F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index e2f98e176f..51a41cfebc 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -90,6 +90,10 @@ # define __DECL_SIMD_expm1 __DECL_SIMD_x86_64 # undef __DECL_SIMD_expm1f # define __DECL_SIMD_expm1f __DECL_SIMD_x86_64 +# undef __DECL_SIMD_sinh +# define __DECL_SIMD_sinh __DECL_SIMD_x86_64 +# undef __DECL_SIMD_sinhf +# define __DECL_SIMD_sinhf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 43233059f6..91e9b4fc83 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -44,6 +44,8 @@ !GCC$ builtin (coshf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (expm1) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (expm1f) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (sinh) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (sinhf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -73,3 +75,5 @@ !GCC$ builtin (coshf) attributes simd (notinbranch) if('x32') !GCC$ builtin (expm1) attributes simd (notinbranch) if('x32') !GCC$ builtin (expm1f) attributes simd (notinbranch) if('x32') +!GCC$ builtin (sinh) attributes simd (notinbranch) if('x32') +!GCC$ builtin (sinhf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 8de8214971..81e9fc95b2 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -36,6 +36,7 @@ libmvec-funcs = \ pow \ sin \ sincos \ + sinh \ # Define libmvec function for benchtests directory. libmvec-bench-funcs = \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 58debb2dbe..2710446d12 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -21,6 +21,7 @@ libmvec { _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2v_expm1; _ZGVcN4v_expm1; _ZGVdN4v_expm1; _ZGVeN8v_expm1; + _ZGVbN2v_sinh; _ZGVcN4v_sinh; _ZGVdN4v_sinh; _ZGVeN8v_sinh; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; @@ -29,6 +30,7 @@ libmvec { _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4v_expm1f; _ZGVcN8v_expm1f; _ZGVdN8v_expm1f; _ZGVeN16v_expm1f; + _ZGVbN4v_sinhf; _ZGVcN8v_sinhf; _ZGVdN8v_sinhf; _ZGVeN16v_sinhf; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; } } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index f05ece8c8a..f4b313119d 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1840,6 +1840,26 @@ float: 3 float128: 4 ldouble: 5 +Function: "sinh_vlen16": +float: 1 + +Function: "sinh_vlen2": +double: 2 + +Function: "sinh_vlen4": +double: 2 +float: 1 + +Function: "sinh_vlen4_avx2": +double: 2 + +Function: "sinh_vlen8": +double: 2 +float: 1 + +Function: "sinh_vlen8_avx2": +float: 1 + Function: "tan": float: 1 float128: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core-sse2.S new file mode 100644 index 0000000000..ca12ad6678 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized sinh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_sinh _ZGVbN2v_sinh_sse2 +#include "../svml_d_sinh2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core.c new file mode 100644 index 0000000000..c0344b2902 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized sinh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_sinh +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_sinh, __GI__ZGVbN2v_sinh, __redirect__ZGVbN2v_sinh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core_sse4.S new file mode 100644 index 0000000000..80d19e9dba --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh2_core_sse4.S @@ -0,0 +1,456 @@ +/* Function sinh vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute sinh(x) as (exp(x)-exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * sinh(NaN) = quiet NaN, and raise invalid exception + * sinh(INF) = that INF + * sinh(x) = x for subnormals + * sinh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_dsinh_data_internal + */ +#define _dbInvLn2 0 +#define _dbLn2hi 16 +#define _dbLn2lo 32 +#define _dSign 48 +#define _dbT 64 +#define _dbShifter 2112 +#define _iDomainRange 2128 +#define _dPC2 2144 +#define _dPC3 2160 +#define _dPC4 2176 +#define _dPC5 2192 +#define _lIndexMask 2208 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_sinh_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm2 + +/* Abs argument */ + movups _dSign+__svml_dsinh_data_internal(%rip), %xmm0 + lea _dbT+8+__svml_dsinh_data_internal(%rip), %rsi + andps %xmm2, %xmm0 + movaps %xmm0, %xmm1 + +/* + * Load argument + * dM = x*2^K/log(2) + RShifter + */ + movups _dbInvLn2+__svml_dsinh_data_internal(%rip), %xmm10 + pxor %xmm2, %xmm1 + mulpd %xmm1, %xmm10 + movups _dbShifter+__svml_dsinh_data_internal(%rip), %xmm5 + addpd %xmm5, %xmm10 + +/* + * R + * dN = dM - RShifter + */ + movaps %xmm10, %xmm7 + subpd %xmm5, %xmm7 + +/* dR = dX - dN*Log2_hi/2^K */ + movups _dbLn2hi+__svml_dsinh_data_internal(%rip), %xmm6 + mulpd %xmm7, %xmm6 + +/* dR = (dX - dN*Log2_hi/2^K) - dN*Log2_lo/2^K */ + movups _dbLn2lo+__svml_dsinh_data_internal(%rip), %xmm8 + mulpd %xmm7, %xmm8 + +/* + * Check for overflow\underflow + * + */ + pshufd $221, %xmm1, %xmm4 + subpd %xmm6, %xmm1 + subpd %xmm8, %xmm1 + +/* VLOAD_CONST( D, dPC[0], TAB._dPC1 ); */ + movq _iDomainRange+__svml_dsinh_data_internal(%rip), %xmm3 + pcmpgtd %xmm3, %xmm4 + +/* dR2 = dR^2 */ + movaps %xmm1, %xmm3 + mulpd %xmm1, %xmm3 + movmskps %xmm4, %edx + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*a5)) = r + r*(r^2*(a3+r^2*a5)) .... + * dSinh_r = (a3+r^2*a5) + */ + movups _dPC5+__svml_dsinh_data_internal(%rip), %xmm12 + +/* + * poly(r) = (dG2+dG1)+dG3*sinh(dR)+dG1*sinh(dR)+(dG1+dG2)*dR2*(a2 +a4*dR2) + * dOut = (a2 +a4*dR2) + */ + movups _dPC4+__svml_dsinh_data_internal(%rip), %xmm13 + mulpd %xmm3, %xmm12 + mulpd %xmm3, %xmm13 + addpd _dPC3+__svml_dsinh_data_internal(%rip), %xmm12 + addpd _dPC2+__svml_dsinh_data_internal(%rip), %xmm13 + +/* dSinh_r = r^2*(a3+r^2*a5) */ + mulpd %xmm3, %xmm12 + +/* dOut = dR2*(a2 +a4*dR2) */ + mulpd %xmm13, %xmm3 + +/* dSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + mulpd %xmm1, %xmm12 + +/* + * Index and lookup + * j + */ + movups _lIndexMask+__svml_dsinh_data_internal(%rip), %xmm9 + andps %xmm10, %xmm9 + movd %xmm9, %eax + +/* split j and N */ + pxor %xmm9, %xmm10 + +/* + * G1,G2,G3: dTdif,dTn * 2^N,2^(-N) + * lM now is an EXP(2^N) + */ + psllq $45, %xmm10 + +/* */ + movaps %xmm10, %xmm4 + pextrw $4, %xmm9, %ecx + addpd %xmm12, %xmm1 + shll $4, %eax + shll $4, %ecx + movq (%rax,%rsi), %xmm11 + movhpd (%rcx,%rsi), %xmm11 + paddq %xmm11, %xmm4 + +/* */ + psubq %xmm10, %xmm11 + +/* dG3 = dTn*2^N + dTn*2^-N */ + movdqa %xmm4, %xmm14 + addpd %xmm11, %xmm14 + +/* dG2 = dTn*2^N - dTn*2^-N */ + subpd %xmm11, %xmm4 + movq -8(%rax,%rsi), %xmm15 + movhpd -8(%rcx,%rsi), %xmm15 + paddq %xmm10, %xmm15 + +/* dG2 += dG1 */ + addpd %xmm15, %xmm4 + +/* dG1 += dG3 */ + addpd %xmm14, %xmm15 + +/* dOut = dG2*dR2*(a2 +a4*dR2) */ + mulpd %xmm4, %xmm3 + +/* dOut = dG1*sinh(dR)+dG2*dR2*(a2 +a4*dR2) */ + mulpd %xmm15, %xmm1 + addpd %xmm1, %xmm3 + +/* dOut = dG2 + dG1*sinh(dR)+dG2*dR2*(a2 +a4*dR2) */ + addpd %xmm3, %xmm4 + +/* Ret H */ + orps %xmm4, %xmm0 + andl $3, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm2, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call sinh@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2v_sinh_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dsinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dbInvLn2[2][2]; + __declspec(align(16)) VUINT32 _dbLn2hi[2][2]; + __declspec(align(16)) VUINT32 _dbLn2lo[2][2]; + __declspec(align(16)) VUINT32 _dSign[2][2]; //0x8000000000000000 + __declspec(align(16)) VUINT32 _dbT[(1<<7)][2][2]; //precalc poly coeff + __declspec(align(16)) VUINT32 _dbShifter[2][2]; + __declspec(align(16)) VUINT32 _iDomainRange[4][1]; + __declspec(align(16)) VUINT32 _dPC2[2][2]; + __declspec(align(16)) VUINT32 _dPC3[2][2]; + __declspec(align(16)) VUINT32 _dPC4[2][2]; + __declspec(align(16)) VUINT32 _dPC5[2][2]; + __declspec(align(16)) VUINT32 _lIndexMask[2][2]; +} __svml_dsinh_data_internal; +#endif +__svml_dsinh_data_internal: + .quad 0x3FF71547652B82FE, 0x3FF71547652B82FE /* _dbInvLn2 = 1/log(2) */ + .align 16 + .quad 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000 /* _dbLn2hi = log(2) hi*/ + .align 16 + .quad 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A /* _dbLn2lo = log(2) lo*/ + .align 16 + .quad 0x8000000000000000, 0x8000000000000000 /* _dSign */ + //_dbT + .align 16 + .quad 0x0000000000000000, 0x3FE0000000000000 //2^( 0 /128-1) - 2^(- 0 /128-1), 2^(- 0 /128-1) + .quad 0x3F762E4A19BD1E74, 0x3FDFD3C22B8F71F1 //2^( 1 /128-1) - 2^(- 1 /128-1), 2^(- 1 /128-1) + .quad 0x3F862E5F6A0DFD36, 0x3FDFA7C1819E90D8 //2^( 2 /128-1) - 2^(- 2 /128-1), 2^(- 2 /128-1) + .quad 0x3F90A2E234040F5F, 0x3FDF7BFDAD9CBE14 //2^( 3 /128-1) - 2^(- 3 /128-1), 2^(- 3 /128-1) + .quad 0x3F962EB4ABCC5A81, 0x3FDF50765B6E4540 //2^( 4 /128-1) - 2^(- 4 /128-1), 2^(- 4 /128-1) + .quad 0x3F9BBAB1C5033244, 0x3FDF252B376BBA97 //2^( 5 /128-1) - 2^(- 5 /128-1), 2^(- 5 /128-1) + .quad 0x3FA0A372144EEB45, 0x3FDEFA1BEE615A27 //2^( 6 /128-1) - 2^(- 6 /128-1), 2^(- 6 /128-1) + .quad 0x3FA369AB3FFBF8B0, 0x3FDECF482D8E67F1 //2^( 7 /128-1) - 2^(- 7 /128-1), 2^(- 7 /128-1) + .quad 0x3FA63009BA740A2A, 0x3FDEA4AFA2A490DA //2^( 8 /128-1) - 2^(- 8 /128-1), 2^(- 8 /128-1) + .quad 0x3FA8F692D8EA1B5A, 0x3FDE7A51FBC74C83 //2^( 9 /128-1) - 2^(- 9 /128-1), 2^(- 9 /128-1) + .quad 0x3FABBD4BF0E31A6F, 0x3FDE502EE78B3FF6 //2^( 10 /128-1) - 2^(- 10 /128-1), 2^(- 10 /128-1) + .quad 0x3FAE843A5840286A, 0x3FDE264614F5A129 //2^( 11 /128-1) - 2^(- 11 /128-1), 2^(- 11 /128-1) + .quad 0x3FB0A5B1B2A46D0A, 0x3FDDFC97337B9B5F //2^( 12 /128-1) - 2^(- 12 /128-1), 2^(- 12 /128-1) + .quad 0x3FB20966375ABCDF, 0x3FDDD321F301B460 //2^( 13 /128-1) - 2^(- 13 /128-1), 2^(- 13 /128-1) + .quad 0x3FB36D3D65DCA4E8, 0x3FDDA9E603DB3285 //2^( 14 /128-1) - 2^(- 14 /128-1), 2^(- 14 /128-1) + .quad 0x3FB4D139EA06642A, 0x3FDD80E316C98398 //2^( 15 /128-1) - 2^(- 15 /128-1), 2^(- 15 /128-1) + .quad 0x3FB6355E6FFBF9BA, 0x3FDD5818DCFBA487 //2^( 16 /128-1) - 2^(- 16 /128-1), 2^(- 16 /128-1) + .quad 0x3FB799ADA42E4788, 0x3FDD2F87080D89F2 //2^( 17 /128-1) - 2^(- 17 /128-1), 2^(- 17 /128-1) + .quad 0x3FB8FE2A336035BC, 0x3FDD072D4A07897C //2^( 18 /128-1) - 2^(- 18 /128-1), 2^(- 18 /128-1) + .quad 0x3FBA62D6CAABD6B6, 0x3FDCDF0B555DC3FA //2^( 19 /128-1) - 2^(- 19 /128-1), 2^(- 19 /128-1) + .quad 0x3FBBC7B617878BAF, 0x3FDCB720DCEF9069 //2^( 20 /128-1) - 2^(- 20 /128-1), 2^(- 20 /128-1) + .quad 0x3FBD2CCAC7CB2A11, 0x3FDC8F6D9406E7B5 //2^( 21 /128-1) - 2^(- 21 /128-1), 2^(- 21 /128-1) + .quad 0x3FBE921789B52185, 0x3FDC67F12E57D14B //2^( 22 /128-1) - 2^(- 22 /128-1), 2^(- 22 /128-1) + .quad 0x3FBFF79F0BEFA2C7, 0x3FDC40AB5FFFD07A //2^( 23 /128-1) - 2^(- 23 /128-1), 2^(- 23 /128-1) + .quad 0x3FC0AEB1FECAE3A9, 0x3FDC199BDD85529C //2^( 24 /128-1) - 2^(- 24 /128-1), 2^(- 24 /128-1) + .quad 0x3FC161B4871C5CEC, 0x3FDBF2C25BD71E09 //2^( 25 /128-1) - 2^(- 25 /128-1), 2^(- 25 /128-1) + .quad 0x3FC214D876F26FD0, 0x3FDBCC1E904BC1D2 //2^( 26 /128-1) - 2^(- 26 /128-1), 2^(- 26 /128-1) + .quad 0x3FC2C81F2693816F, 0x3FDBA5B030A1064A //2^( 27 /128-1) - 2^(- 27 /128-1), 2^(- 27 /128-1) + .quad 0x3FC37B89EE88BEF7, 0x3FDB7F76F2FB5E47 //2^( 28 /128-1) - 2^(- 28 /128-1), 2^(- 28 /128-1) + .quad 0x3FC42F1A27A0B3CD, 0x3FDB59728DE5593A //2^( 29 /128-1) - 2^(- 29 /128-1), 2^(- 29 /128-1) + .quad 0x3FC4E2D12AF1E037, 0x3FDB33A2B84F15FB //2^( 30 /128-1) - 2^(- 30 /128-1), 2^(- 30 /128-1) + .quad 0x3FC596B051DD508D, 0x3FDB0E07298DB666 //2^( 31 /128-1) - 2^(- 31 /128-1), 2^(- 31 /128-1) + .quad 0x3FC64AB8F61134FA, 0x3FDAE89F995AD3AD //2^( 32 /128-1) - 2^(- 32 /128-1), 2^(- 32 /128-1) + .quad 0x3FC6FEEC718B79D1, 0x3FDAC36BBFD3F37A //2^( 33 /128-1) - 2^(- 33 /128-1), 2^(- 33 /128-1) + .quad 0x3FC7B34C1E9C607F, 0x3FDA9E6B5579FDBF //2^( 34 /128-1) - 2^(- 34 /128-1), 2^(- 34 /128-1) + .quad 0x3FC867D957E91912, 0x3FDA799E1330B358 //2^( 35 /128-1) - 2^(- 35 /128-1), 2^(- 35 /128-1) + .quad 0x3FC91C95786E5C72, 0x3FDA5503B23E255D //2^( 36 /128-1) - 2^(- 36 /128-1), 2^(- 36 /128-1) + .quad 0x3FC9D181DB83072F, 0x3FDA309BEC4A2D33 //2^( 37 /128-1) - 2^(- 37 /128-1), 2^(- 37 /128-1) + .quad 0x3FCA869FDCDAB512, 0x3FDA0C667B5DE565 //2^( 38 /128-1) - 2^(- 38 /128-1), 2^(- 38 /128-1) + .quad 0x3FCB3BF0D8885D4C, 0x3FD9E86319E32323 //2^( 39 /128-1) - 2^(- 39 /128-1), 2^(- 39 /128-1) + .quad 0x3FCBF1762B00EF69, 0x3FD9C49182A3F090 //2^( 40 /128-1) - 2^(- 40 /128-1), 2^(- 40 /128-1) + .quad 0x3FCCA731311DF0FB, 0x3FD9A0F170CA07BA //2^( 41 /128-1) - 2^(- 41 /128-1), 2^(- 41 /128-1) + .quad 0x3FCD5D2348201C09, 0x3FD97D829FDE4E50 //2^( 42 /128-1) - 2^(- 42 /128-1), 2^(- 42 /128-1) + .quad 0x3FCE134DCDB1FE3E, 0x3FD95A44CBC8520F //2^( 43 /128-1) - 2^(- 43 /128-1), 2^(- 43 /128-1) + .quad 0x3FCEC9B21FEA98EA, 0x3FD93737B0CDC5E5 //2^( 44 /128-1) - 2^(- 44 /128-1), 2^(- 44 /128-1) + .quad 0x3FCF80519D5001D3, 0x3FD9145B0B91FFC6 //2^( 45 /128-1) - 2^(- 45 /128-1), 2^(- 45 /128-1) + .quad 0x3FD01B96D26D026A, 0x3FD8F1AE99157736 //2^( 46 /128-1) - 2^(- 46 /128-1), 2^(- 46 /128-1) + .quad 0x3FD07723CAFA6331, 0x3FD8CF3216B5448C //2^( 47 /128-1) - 2^(- 47 /128-1), 2^(- 47 /128-1) + .quad 0x3FD0D2D06841B373, 0x3FD8ACE5422AA0DB //2^( 48 /128-1) - 2^(- 48 /128-1), 2^(- 48 /128-1) + .quad 0x3FD12E9D5A715381, 0x3FD88AC7D98A6699 //2^( 49 /128-1) - 2^(- 49 /128-1), 2^(- 49 /128-1) + .quad 0x3FD18A8B51F5C661, 0x3FD868D99B4492ED //2^( 50 /128-1) - 2^(- 50 /128-1), 2^(- 50 /128-1) + .quad 0x3FD1E69AFF7B04D7, 0x3FD8471A4623C7AD //2^( 51 /128-1) - 2^(- 51 /128-1), 2^(- 51 /128-1) + .quad 0x3FD242CD13EDD0F1, 0x3FD82589994CCE13 //2^( 52 /128-1) - 2^(- 52 /128-1), 2^(- 52 /128-1) + .quad 0x3FD29F22407D0A0C, 0x3FD80427543E1A12 //2^( 53 /128-1) - 2^(- 53 /128-1), 2^(- 53 /128-1) + .quad 0x3FD2FB9B369B0153, 0x3FD7E2F336CF4E62 //2^( 54 /128-1) - 2^(- 54 /128-1), 2^(- 54 /128-1) + .quad 0x3FD35838A7FECEC8, 0x3FD7C1ED0130C132 //2^( 55 /128-1) - 2^(- 55 /128-1), 2^(- 55 /128-1) + .quad 0x3FD3B4FB46A5A6CC, 0x3FD7A11473EB0187 //2^( 56 /128-1) - 2^(- 56 /128-1), 2^(- 56 /128-1) + .quad 0x3FD411E3C4D4302F, 0x3FD780694FDE5D3F //2^( 57 /128-1) - 2^(- 57 /128-1), 2^(- 57 /128-1) + .quad 0x3FD46EF2D517DAC8, 0x3FD75FEB564267C9 //2^( 58 /128-1) - 2^(- 58 /128-1), 2^(- 58 /128-1) + .quad 0x3FD4CC292A48369E, 0x3FD73F9A48A58174 //2^( 59 /128-1) - 2^(- 59 /128-1), 2^(- 59 /128-1) + .quad 0x3FD5298777884B96, 0x3FD71F75E8EC5F74 //2^( 60 /128-1) - 2^(- 60 /128-1), 2^(- 60 /128-1) + .quad 0x3FD5870E7047F1BC, 0x3FD6FF7DF9519484 //2^( 61 /128-1) - 2^(- 61 /128-1), 2^(- 61 /128-1) + .quad 0x3FD5E4BEC8452A1A, 0x3FD6DFB23C651A2F //2^( 62 /128-1) - 2^(- 62 /128-1), 2^(- 62 /128-1) + .quad 0x3FD64299338D7827, 0x3FD6C012750BDABF //2^( 63 /128-1) - 2^(- 63 /128-1), 2^(- 63 /128-1) + .quad 0x3FD6A09E667F3BCD, 0x3FD6A09E667F3BCD //2^( 64 /128-1) - 2^(- 64 /128-1), 2^(- 64 /128-1) + .quad 0x3FD6FECF15CB0C0B, 0x3FD68155D44CA973 //2^( 65 /128-1) - 2^(- 65 /128-1), 2^(- 65 /128-1) + .quad 0x3FD75D2BF6751239, 0x3FD6623882552225 //2^( 66 /128-1) - 2^(- 66 /128-1), 2^(- 66 /128-1) + .quad 0x3FD7BBB5BDD665E8, 0x3FD6434634CCC320 //2^( 67 /128-1) - 2^(- 67 /128-1), 2^(- 67 /128-1) + .quad 0x3FD81A6D219E6963, 0x3FD6247EB03A5585 //2^( 68 /128-1) - 2^(- 68 /128-1), 2^(- 68 /128-1) + .quad 0x3FD87952D7D426DF, 0x3FD605E1B976DC09 //2^( 69 /128-1) - 2^(- 69 /128-1), 2^(- 69 /128-1) + .quad 0x3FD8D86796D7AE49, 0x3FD5E76F15AD2148 //2^( 70 /128-1) - 2^(- 70 /128-1), 2^(- 70 /128-1) + .quad 0x3FD937AC156373C8, 0x3FD5C9268A5946B7 //2^( 71 /128-1) - 2^(- 71 /128-1), 2^(- 71 /128-1) + .quad 0x3FD997210A8DAEE4, 0x3FD5AB07DD485429 //2^( 72 /128-1) - 2^(- 72 /128-1), 2^(- 72 /128-1) + .quad 0x3FD9F6C72DC9BA68, 0x3FD58D12D497C7FD //2^( 73 /128-1) - 2^(- 73 /128-1), 2^(- 73 /128-1) + .quad 0x3FDA569F36E974EA, 0x3FD56F4736B527DA //2^( 74 /128-1) - 2^(- 74 /128-1), 2^(- 74 /128-1) + .quad 0x3FDAB6A9DE1EA215, 0x3FD551A4CA5D920F //2^( 75 /128-1) - 2^(- 75 /128-1), 2^(- 75 /128-1) + .quad 0x3FDB16E7DBFC4CA3, 0x3FD5342B569D4F82 //2^( 76 /128-1) - 2^(- 76 /128-1), 2^(- 76 /128-1) + .quad 0x3FDB7759E9782918, 0x3FD516DAA2CF6642 //2^( 77 /128-1) - 2^(- 77 /128-1), 2^(- 77 /128-1) + .quad 0x3FDBD800BFEBF932, 0x3FD4F9B2769D2CA7 //2^( 78 /128-1) - 2^(- 78 /128-1), 2^(- 78 /128-1) + .quad 0x3FDC38DD1916F025, 0x3FD4DCB299FDDD0D //2^( 79 /128-1) - 2^(- 79 /128-1), 2^(- 79 /128-1) + .quad 0x3FDC99EFAF1F1790, 0x3FD4BFDAD5362A27 //2^( 80 /128-1) - 2^(- 80 /128-1), 2^(- 80 /128-1) + .quad 0x3FDCFB393C92B539, 0x3FD4A32AF0D7D3DE //2^( 81 /128-1) - 2^(- 81 /128-1), 2^(- 81 /128-1) + .quad 0x3FDD5CBA7C69B19C, 0x3FD486A2B5C13CD0 //2^( 82 /128-1) - 2^(- 82 /128-1), 2^(- 82 /128-1) + .quad 0x3FDDBE742A06FF34, 0x3FD46A41ED1D0057 //2^( 83 /128-1) - 2^(- 83 /128-1), 2^(- 83 /128-1) + .quad 0x3FDE2067013A029D, 0x3FD44E086061892D //2^( 84 /128-1) - 2^(- 84 /128-1), 2^(- 84 /128-1) + .quad 0x3FDE8293BE3FFB87, 0x3FD431F5D950A897 //2^( 85 /128-1) - 2^(- 85 /128-1), 2^(- 85 /128-1) + .quad 0x3FDEE4FB1DC56E75, 0x3FD4160A21F72E2A //2^( 86 /128-1) - 2^(- 86 /128-1), 2^(- 86 /128-1) + .quad 0x3FDF479DDCE78F58, 0x3FD3FA4504AC801C //2^( 87 /128-1) - 2^(- 87 /128-1), 2^(- 87 /128-1) + .quad 0x3FDFAA7CB935ACFE, 0x3FD3DEA64C123422 //2^( 88 /128-1) - 2^(- 88 /128-1), 2^(- 88 /128-1) + .quad 0x3FE006CC38594EB1, 0x3FD3C32DC313A8E5 //2^( 89 /128-1) - 2^(- 89 /128-1), 2^(- 89 /128-1) + .quad 0x3FE03878E0EB1569, 0x3FD3A7DB34E59FF7 //2^( 90 /128-1) - 2^(- 90 /128-1), 2^(- 90 /128-1) + .quad 0x3FE06A44B5C74101, 0x3FD38CAE6D05D866 //2^( 91 /128-1) - 2^(- 91 /128-1), 2^(- 91 /128-1) + .quad 0x3FE09C3016A0D077, 0x3FD371A7373AA9CB //2^( 92 /128-1) - 2^(- 92 /128-1), 2^(- 92 /128-1) + .quad 0x3FE0CE3B63676360, 0x3FD356C55F929FF1 //2^( 93 /128-1) - 2^(- 93 /128-1), 2^(- 93 /128-1) + .quad 0x3FE10066FC47F240, 0x3FD33C08B26416FF //2^( 94 /128-1) - 2^(- 94 /128-1), 2^(- 94 /128-1) + .quad 0x3FE132B341AD8761, 0x3FD32170FC4CD831 //2^( 95 /128-1) - 2^(- 95 /128-1), 2^(- 95 /128-1) + .quad 0x3FE165209441F823, 0x3FD306FE0A31B715 //2^( 96 /128-1) - 2^(- 96 /128-1), 2^(- 96 /128-1) + .quad 0x3FE197AF54EE9EBB, 0x3FD2ECAFA93E2F56 //2^( 97 /128-1) - 2^(- 97 /128-1), 2^(- 97 /128-1) + .quad 0x3FE1CA5FE4DD1475, 0x3FD2D285A6E4030B //2^( 98 /128-1) - 2^(- 98 /128-1), 2^(- 98 /128-1) + .quad 0x3FE1FD32A577EC72, 0x3FD2B87FD0DAD990 //2^( 99 /128-1) - 2^(- 99 /128-1), 2^(- 99 /128-1) + .quad 0x3FE23027F86B6ED6, 0x3FD29E9DF51FDEE1 //2^( 100 /128-1) - 2^(- 100 /128-1), 2^(- 100 /128-1) + .quad 0x3FE263403FA65489, 0x3FD284DFE1F56381 //2^( 101 /128-1) - 2^(- 101 /128-1), 2^(- 101 /128-1) + .quad 0x3FE2967BDD5A8364, 0x3FD26B4565E27CDD //2^( 102 /128-1) - 2^(- 102 /128-1), 2^(- 102 /128-1) + .quad 0x3FE2C9DB33FDCAE9, 0x3FD251CE4FB2A63F //2^( 103 /128-1) - 2^(- 103 /128-1), 2^(- 103 /128-1) + .quad 0x3FE2FD5EA64AA180, 0x3FD2387A6E756238 //2^( 104 /128-1) - 2^(- 104 /128-1), 2^(- 104 /128-1) + .quad 0x3FE331069740E22F, 0x3FD21F49917DDC96 //2^( 105 /128-1) - 2^(- 105 /128-1), 2^(- 105 /128-1) + .quad 0x3FE364D36A268AE0, 0x3FD2063B88628CD6 //2^( 106 /128-1) - 2^(- 106 /128-1), 2^(- 106 /128-1) + .quad 0x3FE398C582887B27, 0x3FD1ED5022FCD91D //2^( 107 /128-1) - 2^(- 107 /128-1), 2^(- 107 /128-1) + .quad 0x3FE3CCDD443B3394, 0x3FD1D4873168B9AA //2^( 108 /128-1) - 2^(- 108 /128-1), 2^(- 108 /128-1) + .quad 0x3FE4011B135B9590, 0x3FD1BBE084045CD4 //2^( 109 /128-1) - 2^(- 109 /128-1), 2^(- 109 /128-1) + .quad 0x3FE4357F544FA3C1, 0x3FD1A35BEB6FCB75 //2^( 110 /128-1) - 2^(- 110 /128-1), 2^(- 110 /128-1) + .quad 0x3FE46A0A6BC742FD, 0x3FD18AF9388C8DEA //2^( 111 /128-1) - 2^(- 111 /128-1), 2^(- 111 /128-1) + .quad 0x3FE49EBCBEBCFBCA, 0x3FD172B83C7D517B //2^( 112 /128-1) - 2^(- 112 /128-1), 2^(- 112 /128-1) + .quad 0x3FE4D396B276BC6F, 0x3FD15A98C8A58E51 //2^( 113 /128-1) - 2^(- 113 /128-1), 2^(- 113 /128-1) + .quad 0x3FE50898AC869B96, 0x3FD1429AAEA92DE0 //2^( 114 /128-1) - 2^(- 114 /128-1), 2^(- 114 /128-1) + .quad 0x3FE53DC312CB9B7A, 0x3FD12ABDC06C31CC //2^( 115 /128-1) - 2^(- 115 /128-1), 2^(- 115 /128-1) + .quad 0x3FE573164B726DB6, 0x3FD11301D0125B51 //2^( 116 /128-1) - 2^(- 116 /128-1), 2^(- 116 /128-1) + .quad 0x3FE5A892BCF6379B, 0x3FD0FB66AFFED31B //2^( 117 /128-1) - 2^(- 117 /128-1), 2^(- 117 /128-1) + .quad 0x3FE5DE38CE215725, 0x3FD0E3EC32D3D1A2 //2^( 118 /128-1) - 2^(- 118 /128-1), 2^(- 118 /128-1) + .quad 0x3FE61408E60E2888, 0x3FD0CC922B7247F7 //2^( 119 /128-1) - 2^(- 119 /128-1), 2^(- 119 /128-1) + .quad 0x3FE64A036C27CC52, 0x3FD0B5586CF9890F //2^( 120 /128-1) - 2^(- 120 /128-1), 2^(- 120 /128-1) + .quad 0x3FE68028C82AEE2F, 0x3FD09E3ECAC6F383 //2^( 121 /128-1) - 2^(- 121 /128-1), 2^(- 121 /128-1) + .quad 0x3FE6B67962268C43, 0x3FD0874518759BC8 //2^( 122 /128-1) - 2^(- 122 /128-1), 2^(- 122 /128-1) + .quad 0x3FE6ECF5A27CBF28, 0x3FD0706B29DDF6DE //2^( 123 /128-1) - 2^(- 123 /128-1), 2^(- 123 /128-1) + .quad 0x3FE7239DF1E38286, 0x3FD059B0D3158574 //2^( 124 /128-1) - 2^(- 124 /128-1), 2^(- 124 /128-1) + .quad 0x3FE75A72B9657E51, 0x3FD04315E86E7F85 //2^( 125 /128-1) - 2^(- 125 /128-1), 2^(- 125 /128-1) + .quad 0x3FE791746262D0A8, 0x3FD02C9A3E778061 //2^( 126 /128-1) - 2^(- 126 /128-1), 2^(- 126 /128-1) + .quad 0x3FE7C8A35691D856, 0x3FD0163DA9FB3335 //2^( 127 /128-1) - 2^(- 127 /128-1), 2^(- 127 /128-1) + .align 16 + .quad 0x42C8000000000000, 0x42C8000000000000 /* _dbShifter = 1.5 * 2^(52-k)*/ + .align 16 + .long 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99 /* _iDomainRange 0x40861d9ac12a3e85 =(1021*2^K-0.5)*log(2)/2^K -needed for quick exp*/ + .align 16 + .quad 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD /* _dPC2 */ + .align 16 + .quad 0x3FC55555555554AD, 0x3FC55555555554AD /* _dPC3 */ + .align 16 + .quad 0x3FA55555CF16D299, 0x3FA55555CF16D299 /* _dPC4 */ + .align 16 + .quad 0x3F8111115712F425, 0x3F8111115712F425 /* _dPC5 */ + .align 16 + .quad 0x000000000000007f, 0x000000000000007f /* _lIndexMask */ + .align 16 + .type __svml_dsinh_data_internal,@object + .size __svml_dsinh_data_internal,.-__svml_dsinh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core-sse.S new file mode 100644 index 0000000000..ae531575fe --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized sinh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_sinh _ZGVdN4v_sinh_sse_wrapper +#include "../svml_d_sinh4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core.c new file mode 100644 index 0000000000..bdf10b664b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized sinh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_sinh +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_sinh, __GI__ZGVdN4v_sinh, __redirect__ZGVdN4v_sinh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core_avx2.S new file mode 100644 index 0000000000..27b50d31a8 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh4_core_avx2.S @@ -0,0 +1,470 @@ +/* Function sinh vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute sinh(x) as (exp(x)-exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * sinh(NaN) = quiet NaN, and raise invalid exception + * sinh(INF) = that INF + * sinh(x) = x for subnormals + * sinh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_dsinh_data_internal + */ +#define _dbInvLn2 0 +#define _dbLn2hi 32 +#define _dbLn2lo 64 +#define _dSign 96 +#define _dbT 128 +#define _dbShifter 2176 +#define _iDomainRange 2208 +#define _dPC2 2240 +#define _dPC3 2272 +#define _dPC4 2304 +#define _dPC5 2336 +#define _lIndexMask 2368 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_sinh_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea _dbT+8+__svml_dsinh_data_internal(%rip), %r8 + vmovupd _dbShifter+__svml_dsinh_data_internal(%rip), %ymm12 + +/* + * Load argument + * dM = x*2^K/log(2) + RShifter + */ + vmovupd _dbInvLn2+__svml_dsinh_data_internal(%rip), %ymm5 + vmovupd _dbLn2hi+__svml_dsinh_data_internal(%rip), %ymm13 + vmovapd %ymm0, %ymm8 + +/* + * VLOAD_CONST( D, dPC[0], TAB._dPC1 ); + * Abs argument + */ + vandpd _dSign+__svml_dsinh_data_internal(%rip), %ymm8, %ymm7 + vxorpd %ymm8, %ymm7, %ymm6 + vfmadd213pd %ymm12, %ymm6, %ymm5 + +/* + * R + * dN = dM - RShifter + */ + vsubpd %ymm12, %ymm5, %ymm3 + +/* + * Index and lookup + * j + */ + vandps _lIndexMask+__svml_dsinh_data_internal(%rip), %ymm5, %ymm4 + +/* + * Check for overflow\underflow + * + */ + vextractf128 $1, %ymm6, %xmm9 + vshufps $221, %xmm9, %xmm6, %xmm10 + +/* dR = dX - dN*Log2_hi/2^K */ + vfnmadd231pd %ymm13, %ymm3, %ymm6 + vpcmpgtd _iDomainRange+__svml_dsinh_data_internal(%rip), %xmm10, %xmm11 + vmovmskps %xmm11, %eax + +/* dR = (dX - dN*Log2_hi/2^K) - dN*Log2_lo/2^K */ + vfnmadd231pd _dbLn2lo+__svml_dsinh_data_internal(%rip), %ymm3, %ymm6 + vextractf128 $1, %ymm4, %xmm0 + vmovd %xmm4, %edx + vmovd %xmm0, %esi + shll $4, %edx + vpextrd $2, %xmm4, %ecx + +/* split j and N */ + vxorps %ymm4, %ymm5, %ymm3 + shll $4, %esi + vpextrd $2, %xmm0, %edi + shll $4, %ecx + +/* + * G1,G2,G3: dTdif,dTn * 2^N,2^(-N) + * lM now is an EXP(2^N) + */ + vpsllq $45, %ymm3, %ymm4 + vmovq (%rdx,%r8), %xmm14 + vmovq (%rsi,%r8), %xmm1 + vmovhpd (%rcx,%r8), %xmm14, %xmm15 + shll $4, %edi + vmovhpd (%rdi,%r8), %xmm1, %xmm2 + +/* dR2 = dR^2 */ + vmulpd %ymm6, %ymm6, %ymm1 + vmovq -8(%rdx,%r8), %xmm9 + vmovq -8(%rsi,%r8), %xmm11 + vmovhpd -8(%rcx,%r8), %xmm9, %xmm10 + vmovhpd -8(%rdi,%r8), %xmm11, %xmm12 + vinsertf128 $1, %xmm2, %ymm15, %ymm2 + +/* */ + vpaddq %ymm4, %ymm2, %ymm5 + +/* */ + vpsubq %ymm4, %ymm2, %ymm14 + +/* dG3 = dTn*2^N + dTn*2^-N */ + vaddpd %ymm14, %ymm5, %ymm2 + +/* dG2 = dTn*2^N - dTn*2^-N */ + vsubpd %ymm14, %ymm5, %ymm14 + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*a5)) = r + r*(r^2*(a3+r^2*a5)) .... + * dSinh_r = (a3+r^2*a5) + */ + vmovupd _dPC5+__svml_dsinh_data_internal(%rip), %ymm5 + vfmadd213pd _dPC3+__svml_dsinh_data_internal(%rip), %ymm1, %ymm5 + vinsertf128 $1, %xmm12, %ymm10, %ymm13 + vpaddq %ymm4, %ymm13, %ymm0 + +/* dSinh_r = r^2*(a3+r^2*a5) */ + vmulpd %ymm5, %ymm1, %ymm4 + +/* dG2 += dG1 */ + vaddpd %ymm14, %ymm0, %ymm3 + +/* dG1 += dG3 */ + vaddpd %ymm2, %ymm0, %ymm0 + +/* dSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + vfmadd213pd %ymm6, %ymm6, %ymm4 + +/* + * poly(r) = (dG2+dG1)+dG3*sinh(dR)+dG1*sinh(dR)+(dG1+dG2)*dR2*(a2 +a4*dR2) + * dOut = (a2 +a4*dR2) + */ + vmovupd _dPC4+__svml_dsinh_data_internal(%rip), %ymm6 + vfmadd213pd _dPC2+__svml_dsinh_data_internal(%rip), %ymm1, %ymm6 + +/* dOut = dR2*(a2 +a4*dR2) */ + vmulpd %ymm6, %ymm1, %ymm1 + +/* dOut = dG2*dR2*(a2 +a4*dR2) */ + vmulpd %ymm3, %ymm1, %ymm6 + +/* dOut = dG1*sinh(dR)+dG2*dR2*(a2 +a4*dR2) */ + vfmadd213pd %ymm6, %ymm0, %ymm4 + +/* dOut = dG2 + dG1*sinh(dR)+dG2*dR2*(a2 +a4*dR2) */ + vaddpd %ymm4, %ymm3, %ymm5 + +/* Ret H */ + vorpd %ymm5, %ymm7, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm8 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm8, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call sinh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_sinh_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dsinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dbInvLn2[4][2]; + __declspec(align(32)) VUINT32 _dbLn2hi[4][2]; + __declspec(align(32)) VUINT32 _dbLn2lo[4][2]; + __declspec(align(32)) VUINT32 _dSign[4][2]; //0x8000000000000000 + __declspec(align(32)) VUINT32 _dbT[(1<<7)][2][2]; //precalc poly coeff + __declspec(align(32)) VUINT32 _dbShifter[4][2]; + __declspec(align(32)) VUINT32 _iDomainRange[8][1]; + __declspec(align(32)) VUINT32 _dPC2[4][2]; + __declspec(align(32)) VUINT32 _dPC3[4][2]; + __declspec(align(32)) VUINT32 _dPC4[4][2]; + __declspec(align(32)) VUINT32 _dPC5[4][2]; + __declspec(align(32)) VUINT32 _lIndexMask[4][2]; +} __svml_dsinh_data_internal; +#endif +__svml_dsinh_data_internal: + .quad 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE /* _dbInvLn2 = 1/log(2) */ + .align 32 + .quad 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000 /* _dbLn2hi = log(2) hi*/ + .align 32 + .quad 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A /* _dbLn2lo = log(2) lo*/ + .align 32 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 /* _dSign */ + //_dbT + .align 32 + .quad 0x0000000000000000, 0x3FE0000000000000 //2^( 0 /128-1) - 2^(- 0 /128-1), 2^(- 0 /128-1) + .quad 0x3F762E4A19BD1E74, 0x3FDFD3C22B8F71F1 //2^( 1 /128-1) - 2^(- 1 /128-1), 2^(- 1 /128-1) + .quad 0x3F862E5F6A0DFD36, 0x3FDFA7C1819E90D8 //2^( 2 /128-1) - 2^(- 2 /128-1), 2^(- 2 /128-1) + .quad 0x3F90A2E234040F5F, 0x3FDF7BFDAD9CBE14 //2^( 3 /128-1) - 2^(- 3 /128-1), 2^(- 3 /128-1) + .quad 0x3F962EB4ABCC5A81, 0x3FDF50765B6E4540 //2^( 4 /128-1) - 2^(- 4 /128-1), 2^(- 4 /128-1) + .quad 0x3F9BBAB1C5033244, 0x3FDF252B376BBA97 //2^( 5 /128-1) - 2^(- 5 /128-1), 2^(- 5 /128-1) + .quad 0x3FA0A372144EEB45, 0x3FDEFA1BEE615A27 //2^( 6 /128-1) - 2^(- 6 /128-1), 2^(- 6 /128-1) + .quad 0x3FA369AB3FFBF8B0, 0x3FDECF482D8E67F1 //2^( 7 /128-1) - 2^(- 7 /128-1), 2^(- 7 /128-1) + .quad 0x3FA63009BA740A2A, 0x3FDEA4AFA2A490DA //2^( 8 /128-1) - 2^(- 8 /128-1), 2^(- 8 /128-1) + .quad 0x3FA8F692D8EA1B5A, 0x3FDE7A51FBC74C83 //2^( 9 /128-1) - 2^(- 9 /128-1), 2^(- 9 /128-1) + .quad 0x3FABBD4BF0E31A6F, 0x3FDE502EE78B3FF6 //2^( 10 /128-1) - 2^(- 10 /128-1), 2^(- 10 /128-1) + .quad 0x3FAE843A5840286A, 0x3FDE264614F5A129 //2^( 11 /128-1) - 2^(- 11 /128-1), 2^(- 11 /128-1) + .quad 0x3FB0A5B1B2A46D0A, 0x3FDDFC97337B9B5F //2^( 12 /128-1) - 2^(- 12 /128-1), 2^(- 12 /128-1) + .quad 0x3FB20966375ABCDF, 0x3FDDD321F301B460 //2^( 13 /128-1) - 2^(- 13 /128-1), 2^(- 13 /128-1) + .quad 0x3FB36D3D65DCA4E8, 0x3FDDA9E603DB3285 //2^( 14 /128-1) - 2^(- 14 /128-1), 2^(- 14 /128-1) + .quad 0x3FB4D139EA06642A, 0x3FDD80E316C98398 //2^( 15 /128-1) - 2^(- 15 /128-1), 2^(- 15 /128-1) + .quad 0x3FB6355E6FFBF9BA, 0x3FDD5818DCFBA487 //2^( 16 /128-1) - 2^(- 16 /128-1), 2^(- 16 /128-1) + .quad 0x3FB799ADA42E4788, 0x3FDD2F87080D89F2 //2^( 17 /128-1) - 2^(- 17 /128-1), 2^(- 17 /128-1) + .quad 0x3FB8FE2A336035BC, 0x3FDD072D4A07897C //2^( 18 /128-1) - 2^(- 18 /128-1), 2^(- 18 /128-1) + .quad 0x3FBA62D6CAABD6B6, 0x3FDCDF0B555DC3FA //2^( 19 /128-1) - 2^(- 19 /128-1), 2^(- 19 /128-1) + .quad 0x3FBBC7B617878BAF, 0x3FDCB720DCEF9069 //2^( 20 /128-1) - 2^(- 20 /128-1), 2^(- 20 /128-1) + .quad 0x3FBD2CCAC7CB2A11, 0x3FDC8F6D9406E7B5 //2^( 21 /128-1) - 2^(- 21 /128-1), 2^(- 21 /128-1) + .quad 0x3FBE921789B52185, 0x3FDC67F12E57D14B //2^( 22 /128-1) - 2^(- 22 /128-1), 2^(- 22 /128-1) + .quad 0x3FBFF79F0BEFA2C7, 0x3FDC40AB5FFFD07A //2^( 23 /128-1) - 2^(- 23 /128-1), 2^(- 23 /128-1) + .quad 0x3FC0AEB1FECAE3A9, 0x3FDC199BDD85529C //2^( 24 /128-1) - 2^(- 24 /128-1), 2^(- 24 /128-1) + .quad 0x3FC161B4871C5CEC, 0x3FDBF2C25BD71E09 //2^( 25 /128-1) - 2^(- 25 /128-1), 2^(- 25 /128-1) + .quad 0x3FC214D876F26FD0, 0x3FDBCC1E904BC1D2 //2^( 26 /128-1) - 2^(- 26 /128-1), 2^(- 26 /128-1) + .quad 0x3FC2C81F2693816F, 0x3FDBA5B030A1064A //2^( 27 /128-1) - 2^(- 27 /128-1), 2^(- 27 /128-1) + .quad 0x3FC37B89EE88BEF7, 0x3FDB7F76F2FB5E47 //2^( 28 /128-1) - 2^(- 28 /128-1), 2^(- 28 /128-1) + .quad 0x3FC42F1A27A0B3CD, 0x3FDB59728DE5593A //2^( 29 /128-1) - 2^(- 29 /128-1), 2^(- 29 /128-1) + .quad 0x3FC4E2D12AF1E037, 0x3FDB33A2B84F15FB //2^( 30 /128-1) - 2^(- 30 /128-1), 2^(- 30 /128-1) + .quad 0x3FC596B051DD508D, 0x3FDB0E07298DB666 //2^( 31 /128-1) - 2^(- 31 /128-1), 2^(- 31 /128-1) + .quad 0x3FC64AB8F61134FA, 0x3FDAE89F995AD3AD //2^( 32 /128-1) - 2^(- 32 /128-1), 2^(- 32 /128-1) + .quad 0x3FC6FEEC718B79D1, 0x3FDAC36BBFD3F37A //2^( 33 /128-1) - 2^(- 33 /128-1), 2^(- 33 /128-1) + .quad 0x3FC7B34C1E9C607F, 0x3FDA9E6B5579FDBF //2^( 34 /128-1) - 2^(- 34 /128-1), 2^(- 34 /128-1) + .quad 0x3FC867D957E91912, 0x3FDA799E1330B358 //2^( 35 /128-1) - 2^(- 35 /128-1), 2^(- 35 /128-1) + .quad 0x3FC91C95786E5C72, 0x3FDA5503B23E255D //2^( 36 /128-1) - 2^(- 36 /128-1), 2^(- 36 /128-1) + .quad 0x3FC9D181DB83072F, 0x3FDA309BEC4A2D33 //2^( 37 /128-1) - 2^(- 37 /128-1), 2^(- 37 /128-1) + .quad 0x3FCA869FDCDAB512, 0x3FDA0C667B5DE565 //2^( 38 /128-1) - 2^(- 38 /128-1), 2^(- 38 /128-1) + .quad 0x3FCB3BF0D8885D4C, 0x3FD9E86319E32323 //2^( 39 /128-1) - 2^(- 39 /128-1), 2^(- 39 /128-1) + .quad 0x3FCBF1762B00EF69, 0x3FD9C49182A3F090 //2^( 40 /128-1) - 2^(- 40 /128-1), 2^(- 40 /128-1) + .quad 0x3FCCA731311DF0FB, 0x3FD9A0F170CA07BA //2^( 41 /128-1) - 2^(- 41 /128-1), 2^(- 41 /128-1) + .quad 0x3FCD5D2348201C09, 0x3FD97D829FDE4E50 //2^( 42 /128-1) - 2^(- 42 /128-1), 2^(- 42 /128-1) + .quad 0x3FCE134DCDB1FE3E, 0x3FD95A44CBC8520F //2^( 43 /128-1) - 2^(- 43 /128-1), 2^(- 43 /128-1) + .quad 0x3FCEC9B21FEA98EA, 0x3FD93737B0CDC5E5 //2^( 44 /128-1) - 2^(- 44 /128-1), 2^(- 44 /128-1) + .quad 0x3FCF80519D5001D3, 0x3FD9145B0B91FFC6 //2^( 45 /128-1) - 2^(- 45 /128-1), 2^(- 45 /128-1) + .quad 0x3FD01B96D26D026A, 0x3FD8F1AE99157736 //2^( 46 /128-1) - 2^(- 46 /128-1), 2^(- 46 /128-1) + .quad 0x3FD07723CAFA6331, 0x3FD8CF3216B5448C //2^( 47 /128-1) - 2^(- 47 /128-1), 2^(- 47 /128-1) + .quad 0x3FD0D2D06841B373, 0x3FD8ACE5422AA0DB //2^( 48 /128-1) - 2^(- 48 /128-1), 2^(- 48 /128-1) + .quad 0x3FD12E9D5A715381, 0x3FD88AC7D98A6699 //2^( 49 /128-1) - 2^(- 49 /128-1), 2^(- 49 /128-1) + .quad 0x3FD18A8B51F5C661, 0x3FD868D99B4492ED //2^( 50 /128-1) - 2^(- 50 /128-1), 2^(- 50 /128-1) + .quad 0x3FD1E69AFF7B04D7, 0x3FD8471A4623C7AD //2^( 51 /128-1) - 2^(- 51 /128-1), 2^(- 51 /128-1) + .quad 0x3FD242CD13EDD0F1, 0x3FD82589994CCE13 //2^( 52 /128-1) - 2^(- 52 /128-1), 2^(- 52 /128-1) + .quad 0x3FD29F22407D0A0C, 0x3FD80427543E1A12 //2^( 53 /128-1) - 2^(- 53 /128-1), 2^(- 53 /128-1) + .quad 0x3FD2FB9B369B0153, 0x3FD7E2F336CF4E62 //2^( 54 /128-1) - 2^(- 54 /128-1), 2^(- 54 /128-1) + .quad 0x3FD35838A7FECEC8, 0x3FD7C1ED0130C132 //2^( 55 /128-1) - 2^(- 55 /128-1), 2^(- 55 /128-1) + .quad 0x3FD3B4FB46A5A6CC, 0x3FD7A11473EB0187 //2^( 56 /128-1) - 2^(- 56 /128-1), 2^(- 56 /128-1) + .quad 0x3FD411E3C4D4302F, 0x3FD780694FDE5D3F //2^( 57 /128-1) - 2^(- 57 /128-1), 2^(- 57 /128-1) + .quad 0x3FD46EF2D517DAC8, 0x3FD75FEB564267C9 //2^( 58 /128-1) - 2^(- 58 /128-1), 2^(- 58 /128-1) + .quad 0x3FD4CC292A48369E, 0x3FD73F9A48A58174 //2^( 59 /128-1) - 2^(- 59 /128-1), 2^(- 59 /128-1) + .quad 0x3FD5298777884B96, 0x3FD71F75E8EC5F74 //2^( 60 /128-1) - 2^(- 60 /128-1), 2^(- 60 /128-1) + .quad 0x3FD5870E7047F1BC, 0x3FD6FF7DF9519484 //2^( 61 /128-1) - 2^(- 61 /128-1), 2^(- 61 /128-1) + .quad 0x3FD5E4BEC8452A1A, 0x3FD6DFB23C651A2F //2^( 62 /128-1) - 2^(- 62 /128-1), 2^(- 62 /128-1) + .quad 0x3FD64299338D7827, 0x3FD6C012750BDABF //2^( 63 /128-1) - 2^(- 63 /128-1), 2^(- 63 /128-1) + .quad 0x3FD6A09E667F3BCD, 0x3FD6A09E667F3BCD //2^( 64 /128-1) - 2^(- 64 /128-1), 2^(- 64 /128-1) + .quad 0x3FD6FECF15CB0C0B, 0x3FD68155D44CA973 //2^( 65 /128-1) - 2^(- 65 /128-1), 2^(- 65 /128-1) + .quad 0x3FD75D2BF6751239, 0x3FD6623882552225 //2^( 66 /128-1) - 2^(- 66 /128-1), 2^(- 66 /128-1) + .quad 0x3FD7BBB5BDD665E8, 0x3FD6434634CCC320 //2^( 67 /128-1) - 2^(- 67 /128-1), 2^(- 67 /128-1) + .quad 0x3FD81A6D219E6963, 0x3FD6247EB03A5585 //2^( 68 /128-1) - 2^(- 68 /128-1), 2^(- 68 /128-1) + .quad 0x3FD87952D7D426DF, 0x3FD605E1B976DC09 //2^( 69 /128-1) - 2^(- 69 /128-1), 2^(- 69 /128-1) + .quad 0x3FD8D86796D7AE49, 0x3FD5E76F15AD2148 //2^( 70 /128-1) - 2^(- 70 /128-1), 2^(- 70 /128-1) + .quad 0x3FD937AC156373C8, 0x3FD5C9268A5946B7 //2^( 71 /128-1) - 2^(- 71 /128-1), 2^(- 71 /128-1) + .quad 0x3FD997210A8DAEE4, 0x3FD5AB07DD485429 //2^( 72 /128-1) - 2^(- 72 /128-1), 2^(- 72 /128-1) + .quad 0x3FD9F6C72DC9BA68, 0x3FD58D12D497C7FD //2^( 73 /128-1) - 2^(- 73 /128-1), 2^(- 73 /128-1) + .quad 0x3FDA569F36E974EA, 0x3FD56F4736B527DA //2^( 74 /128-1) - 2^(- 74 /128-1), 2^(- 74 /128-1) + .quad 0x3FDAB6A9DE1EA215, 0x3FD551A4CA5D920F //2^( 75 /128-1) - 2^(- 75 /128-1), 2^(- 75 /128-1) + .quad 0x3FDB16E7DBFC4CA3, 0x3FD5342B569D4F82 //2^( 76 /128-1) - 2^(- 76 /128-1), 2^(- 76 /128-1) + .quad 0x3FDB7759E9782918, 0x3FD516DAA2CF6642 //2^( 77 /128-1) - 2^(- 77 /128-1), 2^(- 77 /128-1) + .quad 0x3FDBD800BFEBF932, 0x3FD4F9B2769D2CA7 //2^( 78 /128-1) - 2^(- 78 /128-1), 2^(- 78 /128-1) + .quad 0x3FDC38DD1916F025, 0x3FD4DCB299FDDD0D //2^( 79 /128-1) - 2^(- 79 /128-1), 2^(- 79 /128-1) + .quad 0x3FDC99EFAF1F1790, 0x3FD4BFDAD5362A27 //2^( 80 /128-1) - 2^(- 80 /128-1), 2^(- 80 /128-1) + .quad 0x3FDCFB393C92B539, 0x3FD4A32AF0D7D3DE //2^( 81 /128-1) - 2^(- 81 /128-1), 2^(- 81 /128-1) + .quad 0x3FDD5CBA7C69B19C, 0x3FD486A2B5C13CD0 //2^( 82 /128-1) - 2^(- 82 /128-1), 2^(- 82 /128-1) + .quad 0x3FDDBE742A06FF34, 0x3FD46A41ED1D0057 //2^( 83 /128-1) - 2^(- 83 /128-1), 2^(- 83 /128-1) + .quad 0x3FDE2067013A029D, 0x3FD44E086061892D //2^( 84 /128-1) - 2^(- 84 /128-1), 2^(- 84 /128-1) + .quad 0x3FDE8293BE3FFB87, 0x3FD431F5D950A897 //2^( 85 /128-1) - 2^(- 85 /128-1), 2^(- 85 /128-1) + .quad 0x3FDEE4FB1DC56E75, 0x3FD4160A21F72E2A //2^( 86 /128-1) - 2^(- 86 /128-1), 2^(- 86 /128-1) + .quad 0x3FDF479DDCE78F58, 0x3FD3FA4504AC801C //2^( 87 /128-1) - 2^(- 87 /128-1), 2^(- 87 /128-1) + .quad 0x3FDFAA7CB935ACFE, 0x3FD3DEA64C123422 //2^( 88 /128-1) - 2^(- 88 /128-1), 2^(- 88 /128-1) + .quad 0x3FE006CC38594EB1, 0x3FD3C32DC313A8E5 //2^( 89 /128-1) - 2^(- 89 /128-1), 2^(- 89 /128-1) + .quad 0x3FE03878E0EB1569, 0x3FD3A7DB34E59FF7 //2^( 90 /128-1) - 2^(- 90 /128-1), 2^(- 90 /128-1) + .quad 0x3FE06A44B5C74101, 0x3FD38CAE6D05D866 //2^( 91 /128-1) - 2^(- 91 /128-1), 2^(- 91 /128-1) + .quad 0x3FE09C3016A0D077, 0x3FD371A7373AA9CB //2^( 92 /128-1) - 2^(- 92 /128-1), 2^(- 92 /128-1) + .quad 0x3FE0CE3B63676360, 0x3FD356C55F929FF1 //2^( 93 /128-1) - 2^(- 93 /128-1), 2^(- 93 /128-1) + .quad 0x3FE10066FC47F240, 0x3FD33C08B26416FF //2^( 94 /128-1) - 2^(- 94 /128-1), 2^(- 94 /128-1) + .quad 0x3FE132B341AD8761, 0x3FD32170FC4CD831 //2^( 95 /128-1) - 2^(- 95 /128-1), 2^(- 95 /128-1) + .quad 0x3FE165209441F823, 0x3FD306FE0A31B715 //2^( 96 /128-1) - 2^(- 96 /128-1), 2^(- 96 /128-1) + .quad 0x3FE197AF54EE9EBB, 0x3FD2ECAFA93E2F56 //2^( 97 /128-1) - 2^(- 97 /128-1), 2^(- 97 /128-1) + .quad 0x3FE1CA5FE4DD1475, 0x3FD2D285A6E4030B //2^( 98 /128-1) - 2^(- 98 /128-1), 2^(- 98 /128-1) + .quad 0x3FE1FD32A577EC72, 0x3FD2B87FD0DAD990 //2^( 99 /128-1) - 2^(- 99 /128-1), 2^(- 99 /128-1) + .quad 0x3FE23027F86B6ED6, 0x3FD29E9DF51FDEE1 //2^( 100 /128-1) - 2^(- 100 /128-1), 2^(- 100 /128-1) + .quad 0x3FE263403FA65489, 0x3FD284DFE1F56381 //2^( 101 /128-1) - 2^(- 101 /128-1), 2^(- 101 /128-1) + .quad 0x3FE2967BDD5A8364, 0x3FD26B4565E27CDD //2^( 102 /128-1) - 2^(- 102 /128-1), 2^(- 102 /128-1) + .quad 0x3FE2C9DB33FDCAE9, 0x3FD251CE4FB2A63F //2^( 103 /128-1) - 2^(- 103 /128-1), 2^(- 103 /128-1) + .quad 0x3FE2FD5EA64AA180, 0x3FD2387A6E756238 //2^( 104 /128-1) - 2^(- 104 /128-1), 2^(- 104 /128-1) + .quad 0x3FE331069740E22F, 0x3FD21F49917DDC96 //2^( 105 /128-1) - 2^(- 105 /128-1), 2^(- 105 /128-1) + .quad 0x3FE364D36A268AE0, 0x3FD2063B88628CD6 //2^( 106 /128-1) - 2^(- 106 /128-1), 2^(- 106 /128-1) + .quad 0x3FE398C582887B27, 0x3FD1ED5022FCD91D //2^( 107 /128-1) - 2^(- 107 /128-1), 2^(- 107 /128-1) + .quad 0x3FE3CCDD443B3394, 0x3FD1D4873168B9AA //2^( 108 /128-1) - 2^(- 108 /128-1), 2^(- 108 /128-1) + .quad 0x3FE4011B135B9590, 0x3FD1BBE084045CD4 //2^( 109 /128-1) - 2^(- 109 /128-1), 2^(- 109 /128-1) + .quad 0x3FE4357F544FA3C1, 0x3FD1A35BEB6FCB75 //2^( 110 /128-1) - 2^(- 110 /128-1), 2^(- 110 /128-1) + .quad 0x3FE46A0A6BC742FD, 0x3FD18AF9388C8DEA //2^( 111 /128-1) - 2^(- 111 /128-1), 2^(- 111 /128-1) + .quad 0x3FE49EBCBEBCFBCA, 0x3FD172B83C7D517B //2^( 112 /128-1) - 2^(- 112 /128-1), 2^(- 112 /128-1) + .quad 0x3FE4D396B276BC6F, 0x3FD15A98C8A58E51 //2^( 113 /128-1) - 2^(- 113 /128-1), 2^(- 113 /128-1) + .quad 0x3FE50898AC869B96, 0x3FD1429AAEA92DE0 //2^( 114 /128-1) - 2^(- 114 /128-1), 2^(- 114 /128-1) + .quad 0x3FE53DC312CB9B7A, 0x3FD12ABDC06C31CC //2^( 115 /128-1) - 2^(- 115 /128-1), 2^(- 115 /128-1) + .quad 0x3FE573164B726DB6, 0x3FD11301D0125B51 //2^( 116 /128-1) - 2^(- 116 /128-1), 2^(- 116 /128-1) + .quad 0x3FE5A892BCF6379B, 0x3FD0FB66AFFED31B //2^( 117 /128-1) - 2^(- 117 /128-1), 2^(- 117 /128-1) + .quad 0x3FE5DE38CE215725, 0x3FD0E3EC32D3D1A2 //2^( 118 /128-1) - 2^(- 118 /128-1), 2^(- 118 /128-1) + .quad 0x3FE61408E60E2888, 0x3FD0CC922B7247F7 //2^( 119 /128-1) - 2^(- 119 /128-1), 2^(- 119 /128-1) + .quad 0x3FE64A036C27CC52, 0x3FD0B5586CF9890F //2^( 120 /128-1) - 2^(- 120 /128-1), 2^(- 120 /128-1) + .quad 0x3FE68028C82AEE2F, 0x3FD09E3ECAC6F383 //2^( 121 /128-1) - 2^(- 121 /128-1), 2^(- 121 /128-1) + .quad 0x3FE6B67962268C43, 0x3FD0874518759BC8 //2^( 122 /128-1) - 2^(- 122 /128-1), 2^(- 122 /128-1) + .quad 0x3FE6ECF5A27CBF28, 0x3FD0706B29DDF6DE //2^( 123 /128-1) - 2^(- 123 /128-1), 2^(- 123 /128-1) + .quad 0x3FE7239DF1E38286, 0x3FD059B0D3158574 //2^( 124 /128-1) - 2^(- 124 /128-1), 2^(- 124 /128-1) + .quad 0x3FE75A72B9657E51, 0x3FD04315E86E7F85 //2^( 125 /128-1) - 2^(- 125 /128-1), 2^(- 125 /128-1) + .quad 0x3FE791746262D0A8, 0x3FD02C9A3E778061 //2^( 126 /128-1) - 2^(- 126 /128-1), 2^(- 126 /128-1) + .quad 0x3FE7C8A35691D856, 0x3FD0163DA9FB3335 //2^( 127 /128-1) - 2^(- 127 /128-1), 2^(- 127 /128-1) + .align 32 + .quad 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000 /* _dbShifter = 1.5 * 2^(52-k)*/ + .align 32 + .long 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99 /* _iDomainRange 0x40861d9ac12a3e85 =(1021*2^K-0.5)*log(2)/2^K -needed for quick exp*/ + .align 32 + .quad 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD /* _dPC2 */ + .align 32 + .quad 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD /* _dPC3 */ + .align 32 + .quad 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299 /* _dPC4 */ + .align 32 + .quad 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425 /* _dPC5 */ + .align 32 + .quad 0x000000000000007f, 0x000000000000007f, 0x000000000000007f, 0x000000000000007f /* _lIndexMask */ + .align 32 + .type __svml_dsinh_data_internal,@object + .size __svml_dsinh_data_internal,.-__svml_dsinh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core-avx2.S new file mode 100644 index 0000000000..d767d25080 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized sinh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_sinh _ZGVeN8v_sinh_avx2_wrapper +#include "../svml_d_sinh8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core.c new file mode 100644 index 0000000000..427d07bce2 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized sinh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_sinh +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_sinh, __GI__ZGVeN8v_sinh, __redirect__ZGVeN8v_sinh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core_avx512.S new file mode 100644 index 0000000000..d057d6c7eb --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_sinh8_core_avx512.S @@ -0,0 +1,461 @@ +/* Function sinh vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute sinh(x) as (exp(x)-exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * sinh(NaN) = quiet NaN, and raise invalid exception + * sinh(INF) = that INF + * sinh(x) = x for subnormals + * sinh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_dsinh_data_internal + */ +#define _dbInvLn2 0 +#define _dbLn2hi 64 +#define _dbLn2lo 128 +#define _dSign 192 +#define _dbT 256 +#define _dbShifter 2304 +#define _iDomainRange 2368 +#define _dPC2 2432 +#define _dPC3 2496 +#define _dPC4 2560 +#define _dPC5 2624 +#define _lIndexMask 2688 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_sinh_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + lea _dbT+8+__svml_dsinh_data_internal(%rip), %rax + vmovaps %zmm0, %zmm8 + +/* Abs argument */ + vandpd _dSign+__svml_dsinh_data_internal(%rip), %zmm8, %zmm7 + vmovups _dbShifter+__svml_dsinh_data_internal(%rip), %zmm13 + +/* + * Load argument + * dM = x*2^K/log(2) + RShifter + */ + vmovups _dbInvLn2+__svml_dsinh_data_internal(%rip), %zmm12 + vmovups _dbLn2hi+__svml_dsinh_data_internal(%rip), %zmm14 + vmovups _dPC5+__svml_dsinh_data_internal(%rip), %zmm6 + +/* VLOAD_CONST( D, dPC[0], TAB._dPC1 ); */ + vmovups _dPC4+__svml_dsinh_data_internal(%rip), %zmm4 + vxorpd %zmm8, %zmm7, %zmm5 + kxnorw %k0, %k0, %k1 + kxnorw %k0, %k0, %k2 + vfmadd213pd {rn-sae}, %zmm13, %zmm5, %zmm12 + +/* + * Check for overflow\underflow + * + */ + vpsrlq $32, %zmm5, %zmm9 + +/* + * R + * dN = dM - RShifter + */ + vsubpd {rn-sae}, %zmm13, %zmm12, %zmm2 + vpmovqd %zmm9, %ymm10 + vmovups _dbLn2lo+__svml_dsinh_data_internal(%rip), %zmm9 + +/* dR = dX - dN*Log2_hi/2^K */ + vfnmadd231pd {rn-sae}, %zmm14, %zmm2, %zmm5 + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*a5)) = r + r*(r^2*(a3+r^2*a5)) .... + * dSinh_r = (a3+r^2*a5) + */ + vmovups _dPC3+__svml_dsinh_data_internal(%rip), %zmm14 + +/* dR = (dX - dN*Log2_hi/2^K) - dN*Log2_lo/2^K */ + vfnmadd231pd {rn-sae}, %zmm9, %zmm2, %zmm5 + vpcmpgtd _iDomainRange+__svml_dsinh_data_internal(%rip), %ymm10, %ymm11 + vmovmskps %ymm11, %edx + +/* dR2 = dR^2 */ + vmulpd {rn-sae}, %zmm5, %zmm5, %zmm2 + vfmadd231pd {rn-sae}, %zmm2, %zmm6, %zmm14 + +/* + * Index and lookup + * j + */ + vpandq _lIndexMask+__svml_dsinh_data_internal(%rip), %zmm12, %zmm15 + vpsllq $4, %zmm15, %zmm1 + vpmovqd %zmm1, %ymm0 + vpxord %zmm11, %zmm11, %zmm11 + vpxord %zmm10, %zmm10, %zmm10 + vgatherdpd (%rax,%ymm0), %zmm11{%k1} + vgatherdpd -8(%rax,%ymm0), %zmm10{%k2} + +/* split j and N */ + vpxorq %zmm15, %zmm12, %zmm3 + +/* + * G1,G2,G3: dTdif,dTn * 2^N,2^(-N) + * lM now is an EXP(2^N) + */ + vpsllq $45, %zmm3, %zmm3 + vpaddq %zmm3, %zmm10, %zmm1 + +/* */ + vpaddq %zmm3, %zmm11, %zmm12 + +/* */ + vpsubq %zmm3, %zmm11, %zmm13 + +/* dSinh_r = r^2*(a3+r^2*a5) */ + vmulpd {rn-sae}, %zmm2, %zmm14, %zmm3 + +/* dG2 = dTn*2^N - dTn*2^-N */ + vsubpd {rn-sae}, %zmm13, %zmm12, %zmm15 + +/* dG3 = dTn*2^N + dTn*2^-N */ + vaddpd {rn-sae}, %zmm13, %zmm12, %zmm0 + +/* dSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + vfmadd213pd {rn-sae}, %zmm5, %zmm5, %zmm3 + +/* + * poly(r) = (dG2+dG1)+dG3*sinh(dR)+dG1*sinh(dR)+(dG1+dG2)*dR2*(a2 +a4*dR2) + * dOut = (a2 +a4*dR2) + */ + vmovups _dPC2+__svml_dsinh_data_internal(%rip), %zmm5 + +/* dG1 += dG3 */ + vaddpd {rn-sae}, %zmm0, %zmm1, %zmm6 + vfmadd231pd {rn-sae}, %zmm2, %zmm4, %zmm5 + +/* dOut = dR2*(a2 +a4*dR2) */ + vmulpd {rn-sae}, %zmm2, %zmm5, %zmm4 + +/* dG2 += dG1 */ + vaddpd {rn-sae}, %zmm15, %zmm1, %zmm2 + +/* dOut = dG2*dR2*(a2 +a4*dR2) */ + vmulpd {rn-sae}, %zmm2, %zmm4, %zmm4 + +/* dOut = dG1*sinh(dR)+dG2*dR2*(a2 +a4*dR2) */ + vfmadd213pd {rn-sae}, %zmm4, %zmm6, %zmm3 + +/* dOut = dG2 + dG1*sinh(dR)+dG2*dR2*(a2 +a4*dR2) */ + vaddpd {rn-sae}, %zmm2, %zmm3, %zmm0 + +/* Ret H */ + vorpd %zmm0, %zmm7, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm8 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm8, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call sinh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_sinh_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dsinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _dbInvLn2[8][2]; + __declspec(align(64)) VUINT32 _dbLn2hi[8][2]; + __declspec(align(64)) VUINT32 _dbLn2lo[8][2]; + __declspec(align(64)) VUINT32 _dSign[8][2]; //0x8000000000000000 + __declspec(align(64)) VUINT32 _dbT[(1<<7)][2][2]; //precalc poly coeff + __declspec(align(64)) VUINT32 _dbShifter[8][2]; + __declspec(align(64)) VUINT32 _iDomainRange[16][1]; + __declspec(align(64)) VUINT32 _dPC2[8][2]; + __declspec(align(64)) VUINT32 _dPC3[8][2]; + __declspec(align(64)) VUINT32 _dPC4[8][2]; + __declspec(align(64)) VUINT32 _dPC5[8][2]; + __declspec(align(64)) VUINT32 _lIndexMask[8][2]; +} __svml_dsinh_data_internal; +#endif +__svml_dsinh_data_internal: + .quad 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE, 0x3FF71547652B82FE /* _dbInvLn2 = 1/log(2) */ + .align 64 + .quad 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000, 0x3FE62E42FEFA0000 /* _dbLn2hi = log(2) hi*/ + .align 64 + .quad 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A, 0x3D7CF79ABC9E3B3A /* _dbLn2lo = log(2) lo*/ + .align 64 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 /* _dSign */ + //_dbT + .align 64 + .quad 0x0000000000000000, 0x3FE0000000000000 //2^( 0 /128-1) - 2^(- 0 /128-1), 2^(- 0 /128-1) + .quad 0x3F762E4A19BD1E74, 0x3FDFD3C22B8F71F1 //2^( 1 /128-1) - 2^(- 1 /128-1), 2^(- 1 /128-1) + .quad 0x3F862E5F6A0DFD36, 0x3FDFA7C1819E90D8 //2^( 2 /128-1) - 2^(- 2 /128-1), 2^(- 2 /128-1) + .quad 0x3F90A2E234040F5F, 0x3FDF7BFDAD9CBE14 //2^( 3 /128-1) - 2^(- 3 /128-1), 2^(- 3 /128-1) + .quad 0x3F962EB4ABCC5A81, 0x3FDF50765B6E4540 //2^( 4 /128-1) - 2^(- 4 /128-1), 2^(- 4 /128-1) + .quad 0x3F9BBAB1C5033244, 0x3FDF252B376BBA97 //2^( 5 /128-1) - 2^(- 5 /128-1), 2^(- 5 /128-1) + .quad 0x3FA0A372144EEB45, 0x3FDEFA1BEE615A27 //2^( 6 /128-1) - 2^(- 6 /128-1), 2^(- 6 /128-1) + .quad 0x3FA369AB3FFBF8B0, 0x3FDECF482D8E67F1 //2^( 7 /128-1) - 2^(- 7 /128-1), 2^(- 7 /128-1) + .quad 0x3FA63009BA740A2A, 0x3FDEA4AFA2A490DA //2^( 8 /128-1) - 2^(- 8 /128-1), 2^(- 8 /128-1) + .quad 0x3FA8F692D8EA1B5A, 0x3FDE7A51FBC74C83 //2^( 9 /128-1) - 2^(- 9 /128-1), 2^(- 9 /128-1) + .quad 0x3FABBD4BF0E31A6F, 0x3FDE502EE78B3FF6 //2^( 10 /128-1) - 2^(- 10 /128-1), 2^(- 10 /128-1) + .quad 0x3FAE843A5840286A, 0x3FDE264614F5A129 //2^( 11 /128-1) - 2^(- 11 /128-1), 2^(- 11 /128-1) + .quad 0x3FB0A5B1B2A46D0A, 0x3FDDFC97337B9B5F //2^( 12 /128-1) - 2^(- 12 /128-1), 2^(- 12 /128-1) + .quad 0x3FB20966375ABCDF, 0x3FDDD321F301B460 //2^( 13 /128-1) - 2^(- 13 /128-1), 2^(- 13 /128-1) + .quad 0x3FB36D3D65DCA4E8, 0x3FDDA9E603DB3285 //2^( 14 /128-1) - 2^(- 14 /128-1), 2^(- 14 /128-1) + .quad 0x3FB4D139EA06642A, 0x3FDD80E316C98398 //2^( 15 /128-1) - 2^(- 15 /128-1), 2^(- 15 /128-1) + .quad 0x3FB6355E6FFBF9BA, 0x3FDD5818DCFBA487 //2^( 16 /128-1) - 2^(- 16 /128-1), 2^(- 16 /128-1) + .quad 0x3FB799ADA42E4788, 0x3FDD2F87080D89F2 //2^( 17 /128-1) - 2^(- 17 /128-1), 2^(- 17 /128-1) + .quad 0x3FB8FE2A336035BC, 0x3FDD072D4A07897C //2^( 18 /128-1) - 2^(- 18 /128-1), 2^(- 18 /128-1) + .quad 0x3FBA62D6CAABD6B6, 0x3FDCDF0B555DC3FA //2^( 19 /128-1) - 2^(- 19 /128-1), 2^(- 19 /128-1) + .quad 0x3FBBC7B617878BAF, 0x3FDCB720DCEF9069 //2^( 20 /128-1) - 2^(- 20 /128-1), 2^(- 20 /128-1) + .quad 0x3FBD2CCAC7CB2A11, 0x3FDC8F6D9406E7B5 //2^( 21 /128-1) - 2^(- 21 /128-1), 2^(- 21 /128-1) + .quad 0x3FBE921789B52185, 0x3FDC67F12E57D14B //2^( 22 /128-1) - 2^(- 22 /128-1), 2^(- 22 /128-1) + .quad 0x3FBFF79F0BEFA2C7, 0x3FDC40AB5FFFD07A //2^( 23 /128-1) - 2^(- 23 /128-1), 2^(- 23 /128-1) + .quad 0x3FC0AEB1FECAE3A9, 0x3FDC199BDD85529C //2^( 24 /128-1) - 2^(- 24 /128-1), 2^(- 24 /128-1) + .quad 0x3FC161B4871C5CEC, 0x3FDBF2C25BD71E09 //2^( 25 /128-1) - 2^(- 25 /128-1), 2^(- 25 /128-1) + .quad 0x3FC214D876F26FD0, 0x3FDBCC1E904BC1D2 //2^( 26 /128-1) - 2^(- 26 /128-1), 2^(- 26 /128-1) + .quad 0x3FC2C81F2693816F, 0x3FDBA5B030A1064A //2^( 27 /128-1) - 2^(- 27 /128-1), 2^(- 27 /128-1) + .quad 0x3FC37B89EE88BEF7, 0x3FDB7F76F2FB5E47 //2^( 28 /128-1) - 2^(- 28 /128-1), 2^(- 28 /128-1) + .quad 0x3FC42F1A27A0B3CD, 0x3FDB59728DE5593A //2^( 29 /128-1) - 2^(- 29 /128-1), 2^(- 29 /128-1) + .quad 0x3FC4E2D12AF1E037, 0x3FDB33A2B84F15FB //2^( 30 /128-1) - 2^(- 30 /128-1), 2^(- 30 /128-1) + .quad 0x3FC596B051DD508D, 0x3FDB0E07298DB666 //2^( 31 /128-1) - 2^(- 31 /128-1), 2^(- 31 /128-1) + .quad 0x3FC64AB8F61134FA, 0x3FDAE89F995AD3AD //2^( 32 /128-1) - 2^(- 32 /128-1), 2^(- 32 /128-1) + .quad 0x3FC6FEEC718B79D1, 0x3FDAC36BBFD3F37A //2^( 33 /128-1) - 2^(- 33 /128-1), 2^(- 33 /128-1) + .quad 0x3FC7B34C1E9C607F, 0x3FDA9E6B5579FDBF //2^( 34 /128-1) - 2^(- 34 /128-1), 2^(- 34 /128-1) + .quad 0x3FC867D957E91912, 0x3FDA799E1330B358 //2^( 35 /128-1) - 2^(- 35 /128-1), 2^(- 35 /128-1) + .quad 0x3FC91C95786E5C72, 0x3FDA5503B23E255D //2^( 36 /128-1) - 2^(- 36 /128-1), 2^(- 36 /128-1) + .quad 0x3FC9D181DB83072F, 0x3FDA309BEC4A2D33 //2^( 37 /128-1) - 2^(- 37 /128-1), 2^(- 37 /128-1) + .quad 0x3FCA869FDCDAB512, 0x3FDA0C667B5DE565 //2^( 38 /128-1) - 2^(- 38 /128-1), 2^(- 38 /128-1) + .quad 0x3FCB3BF0D8885D4C, 0x3FD9E86319E32323 //2^( 39 /128-1) - 2^(- 39 /128-1), 2^(- 39 /128-1) + .quad 0x3FCBF1762B00EF69, 0x3FD9C49182A3F090 //2^( 40 /128-1) - 2^(- 40 /128-1), 2^(- 40 /128-1) + .quad 0x3FCCA731311DF0FB, 0x3FD9A0F170CA07BA //2^( 41 /128-1) - 2^(- 41 /128-1), 2^(- 41 /128-1) + .quad 0x3FCD5D2348201C09, 0x3FD97D829FDE4E50 //2^( 42 /128-1) - 2^(- 42 /128-1), 2^(- 42 /128-1) + .quad 0x3FCE134DCDB1FE3E, 0x3FD95A44CBC8520F //2^( 43 /128-1) - 2^(- 43 /128-1), 2^(- 43 /128-1) + .quad 0x3FCEC9B21FEA98EA, 0x3FD93737B0CDC5E5 //2^( 44 /128-1) - 2^(- 44 /128-1), 2^(- 44 /128-1) + .quad 0x3FCF80519D5001D3, 0x3FD9145B0B91FFC6 //2^( 45 /128-1) - 2^(- 45 /128-1), 2^(- 45 /128-1) + .quad 0x3FD01B96D26D026A, 0x3FD8F1AE99157736 //2^( 46 /128-1) - 2^(- 46 /128-1), 2^(- 46 /128-1) + .quad 0x3FD07723CAFA6331, 0x3FD8CF3216B5448C //2^( 47 /128-1) - 2^(- 47 /128-1), 2^(- 47 /128-1) + .quad 0x3FD0D2D06841B373, 0x3FD8ACE5422AA0DB //2^( 48 /128-1) - 2^(- 48 /128-1), 2^(- 48 /128-1) + .quad 0x3FD12E9D5A715381, 0x3FD88AC7D98A6699 //2^( 49 /128-1) - 2^(- 49 /128-1), 2^(- 49 /128-1) + .quad 0x3FD18A8B51F5C661, 0x3FD868D99B4492ED //2^( 50 /128-1) - 2^(- 50 /128-1), 2^(- 50 /128-1) + .quad 0x3FD1E69AFF7B04D7, 0x3FD8471A4623C7AD //2^( 51 /128-1) - 2^(- 51 /128-1), 2^(- 51 /128-1) + .quad 0x3FD242CD13EDD0F1, 0x3FD82589994CCE13 //2^( 52 /128-1) - 2^(- 52 /128-1), 2^(- 52 /128-1) + .quad 0x3FD29F22407D0A0C, 0x3FD80427543E1A12 //2^( 53 /128-1) - 2^(- 53 /128-1), 2^(- 53 /128-1) + .quad 0x3FD2FB9B369B0153, 0x3FD7E2F336CF4E62 //2^( 54 /128-1) - 2^(- 54 /128-1), 2^(- 54 /128-1) + .quad 0x3FD35838A7FECEC8, 0x3FD7C1ED0130C132 //2^( 55 /128-1) - 2^(- 55 /128-1), 2^(- 55 /128-1) + .quad 0x3FD3B4FB46A5A6CC, 0x3FD7A11473EB0187 //2^( 56 /128-1) - 2^(- 56 /128-1), 2^(- 56 /128-1) + .quad 0x3FD411E3C4D4302F, 0x3FD780694FDE5D3F //2^( 57 /128-1) - 2^(- 57 /128-1), 2^(- 57 /128-1) + .quad 0x3FD46EF2D517DAC8, 0x3FD75FEB564267C9 //2^( 58 /128-1) - 2^(- 58 /128-1), 2^(- 58 /128-1) + .quad 0x3FD4CC292A48369E, 0x3FD73F9A48A58174 //2^( 59 /128-1) - 2^(- 59 /128-1), 2^(- 59 /128-1) + .quad 0x3FD5298777884B96, 0x3FD71F75E8EC5F74 //2^( 60 /128-1) - 2^(- 60 /128-1), 2^(- 60 /128-1) + .quad 0x3FD5870E7047F1BC, 0x3FD6FF7DF9519484 //2^( 61 /128-1) - 2^(- 61 /128-1), 2^(- 61 /128-1) + .quad 0x3FD5E4BEC8452A1A, 0x3FD6DFB23C651A2F //2^( 62 /128-1) - 2^(- 62 /128-1), 2^(- 62 /128-1) + .quad 0x3FD64299338D7827, 0x3FD6C012750BDABF //2^( 63 /128-1) - 2^(- 63 /128-1), 2^(- 63 /128-1) + .quad 0x3FD6A09E667F3BCD, 0x3FD6A09E667F3BCD //2^( 64 /128-1) - 2^(- 64 /128-1), 2^(- 64 /128-1) + .quad 0x3FD6FECF15CB0C0B, 0x3FD68155D44CA973 //2^( 65 /128-1) - 2^(- 65 /128-1), 2^(- 65 /128-1) + .quad 0x3FD75D2BF6751239, 0x3FD6623882552225 //2^( 66 /128-1) - 2^(- 66 /128-1), 2^(- 66 /128-1) + .quad 0x3FD7BBB5BDD665E8, 0x3FD6434634CCC320 //2^( 67 /128-1) - 2^(- 67 /128-1), 2^(- 67 /128-1) + .quad 0x3FD81A6D219E6963, 0x3FD6247EB03A5585 //2^( 68 /128-1) - 2^(- 68 /128-1), 2^(- 68 /128-1) + .quad 0x3FD87952D7D426DF, 0x3FD605E1B976DC09 //2^( 69 /128-1) - 2^(- 69 /128-1), 2^(- 69 /128-1) + .quad 0x3FD8D86796D7AE49, 0x3FD5E76F15AD2148 //2^( 70 /128-1) - 2^(- 70 /128-1), 2^(- 70 /128-1) + .quad 0x3FD937AC156373C8, 0x3FD5C9268A5946B7 //2^( 71 /128-1) - 2^(- 71 /128-1), 2^(- 71 /128-1) + .quad 0x3FD997210A8DAEE4, 0x3FD5AB07DD485429 //2^( 72 /128-1) - 2^(- 72 /128-1), 2^(- 72 /128-1) + .quad 0x3FD9F6C72DC9BA68, 0x3FD58D12D497C7FD //2^( 73 /128-1) - 2^(- 73 /128-1), 2^(- 73 /128-1) + .quad 0x3FDA569F36E974EA, 0x3FD56F4736B527DA //2^( 74 /128-1) - 2^(- 74 /128-1), 2^(- 74 /128-1) + .quad 0x3FDAB6A9DE1EA215, 0x3FD551A4CA5D920F //2^( 75 /128-1) - 2^(- 75 /128-1), 2^(- 75 /128-1) + .quad 0x3FDB16E7DBFC4CA3, 0x3FD5342B569D4F82 //2^( 76 /128-1) - 2^(- 76 /128-1), 2^(- 76 /128-1) + .quad 0x3FDB7759E9782918, 0x3FD516DAA2CF6642 //2^( 77 /128-1) - 2^(- 77 /128-1), 2^(- 77 /128-1) + .quad 0x3FDBD800BFEBF932, 0x3FD4F9B2769D2CA7 //2^( 78 /128-1) - 2^(- 78 /128-1), 2^(- 78 /128-1) + .quad 0x3FDC38DD1916F025, 0x3FD4DCB299FDDD0D //2^( 79 /128-1) - 2^(- 79 /128-1), 2^(- 79 /128-1) + .quad 0x3FDC99EFAF1F1790, 0x3FD4BFDAD5362A27 //2^( 80 /128-1) - 2^(- 80 /128-1), 2^(- 80 /128-1) + .quad 0x3FDCFB393C92B539, 0x3FD4A32AF0D7D3DE //2^( 81 /128-1) - 2^(- 81 /128-1), 2^(- 81 /128-1) + .quad 0x3FDD5CBA7C69B19C, 0x3FD486A2B5C13CD0 //2^( 82 /128-1) - 2^(- 82 /128-1), 2^(- 82 /128-1) + .quad 0x3FDDBE742A06FF34, 0x3FD46A41ED1D0057 //2^( 83 /128-1) - 2^(- 83 /128-1), 2^(- 83 /128-1) + .quad 0x3FDE2067013A029D, 0x3FD44E086061892D //2^( 84 /128-1) - 2^(- 84 /128-1), 2^(- 84 /128-1) + .quad 0x3FDE8293BE3FFB87, 0x3FD431F5D950A897 //2^( 85 /128-1) - 2^(- 85 /128-1), 2^(- 85 /128-1) + .quad 0x3FDEE4FB1DC56E75, 0x3FD4160A21F72E2A //2^( 86 /128-1) - 2^(- 86 /128-1), 2^(- 86 /128-1) + .quad 0x3FDF479DDCE78F58, 0x3FD3FA4504AC801C //2^( 87 /128-1) - 2^(- 87 /128-1), 2^(- 87 /128-1) + .quad 0x3FDFAA7CB935ACFE, 0x3FD3DEA64C123422 //2^( 88 /128-1) - 2^(- 88 /128-1), 2^(- 88 /128-1) + .quad 0x3FE006CC38594EB1, 0x3FD3C32DC313A8E5 //2^( 89 /128-1) - 2^(- 89 /128-1), 2^(- 89 /128-1) + .quad 0x3FE03878E0EB1569, 0x3FD3A7DB34E59FF7 //2^( 90 /128-1) - 2^(- 90 /128-1), 2^(- 90 /128-1) + .quad 0x3FE06A44B5C74101, 0x3FD38CAE6D05D866 //2^( 91 /128-1) - 2^(- 91 /128-1), 2^(- 91 /128-1) + .quad 0x3FE09C3016A0D077, 0x3FD371A7373AA9CB //2^( 92 /128-1) - 2^(- 92 /128-1), 2^(- 92 /128-1) + .quad 0x3FE0CE3B63676360, 0x3FD356C55F929FF1 //2^( 93 /128-1) - 2^(- 93 /128-1), 2^(- 93 /128-1) + .quad 0x3FE10066FC47F240, 0x3FD33C08B26416FF //2^( 94 /128-1) - 2^(- 94 /128-1), 2^(- 94 /128-1) + .quad 0x3FE132B341AD8761, 0x3FD32170FC4CD831 //2^( 95 /128-1) - 2^(- 95 /128-1), 2^(- 95 /128-1) + .quad 0x3FE165209441F823, 0x3FD306FE0A31B715 //2^( 96 /128-1) - 2^(- 96 /128-1), 2^(- 96 /128-1) + .quad 0x3FE197AF54EE9EBB, 0x3FD2ECAFA93E2F56 //2^( 97 /128-1) - 2^(- 97 /128-1), 2^(- 97 /128-1) + .quad 0x3FE1CA5FE4DD1475, 0x3FD2D285A6E4030B //2^( 98 /128-1) - 2^(- 98 /128-1), 2^(- 98 /128-1) + .quad 0x3FE1FD32A577EC72, 0x3FD2B87FD0DAD990 //2^( 99 /128-1) - 2^(- 99 /128-1), 2^(- 99 /128-1) + .quad 0x3FE23027F86B6ED6, 0x3FD29E9DF51FDEE1 //2^( 100 /128-1) - 2^(- 100 /128-1), 2^(- 100 /128-1) + .quad 0x3FE263403FA65489, 0x3FD284DFE1F56381 //2^( 101 /128-1) - 2^(- 101 /128-1), 2^(- 101 /128-1) + .quad 0x3FE2967BDD5A8364, 0x3FD26B4565E27CDD //2^( 102 /128-1) - 2^(- 102 /128-1), 2^(- 102 /128-1) + .quad 0x3FE2C9DB33FDCAE9, 0x3FD251CE4FB2A63F //2^( 103 /128-1) - 2^(- 103 /128-1), 2^(- 103 /128-1) + .quad 0x3FE2FD5EA64AA180, 0x3FD2387A6E756238 //2^( 104 /128-1) - 2^(- 104 /128-1), 2^(- 104 /128-1) + .quad 0x3FE331069740E22F, 0x3FD21F49917DDC96 //2^( 105 /128-1) - 2^(- 105 /128-1), 2^(- 105 /128-1) + .quad 0x3FE364D36A268AE0, 0x3FD2063B88628CD6 //2^( 106 /128-1) - 2^(- 106 /128-1), 2^(- 106 /128-1) + .quad 0x3FE398C582887B27, 0x3FD1ED5022FCD91D //2^( 107 /128-1) - 2^(- 107 /128-1), 2^(- 107 /128-1) + .quad 0x3FE3CCDD443B3394, 0x3FD1D4873168B9AA //2^( 108 /128-1) - 2^(- 108 /128-1), 2^(- 108 /128-1) + .quad 0x3FE4011B135B9590, 0x3FD1BBE084045CD4 //2^( 109 /128-1) - 2^(- 109 /128-1), 2^(- 109 /128-1) + .quad 0x3FE4357F544FA3C1, 0x3FD1A35BEB6FCB75 //2^( 110 /128-1) - 2^(- 110 /128-1), 2^(- 110 /128-1) + .quad 0x3FE46A0A6BC742FD, 0x3FD18AF9388C8DEA //2^( 111 /128-1) - 2^(- 111 /128-1), 2^(- 111 /128-1) + .quad 0x3FE49EBCBEBCFBCA, 0x3FD172B83C7D517B //2^( 112 /128-1) - 2^(- 112 /128-1), 2^(- 112 /128-1) + .quad 0x3FE4D396B276BC6F, 0x3FD15A98C8A58E51 //2^( 113 /128-1) - 2^(- 113 /128-1), 2^(- 113 /128-1) + .quad 0x3FE50898AC869B96, 0x3FD1429AAEA92DE0 //2^( 114 /128-1) - 2^(- 114 /128-1), 2^(- 114 /128-1) + .quad 0x3FE53DC312CB9B7A, 0x3FD12ABDC06C31CC //2^( 115 /128-1) - 2^(- 115 /128-1), 2^(- 115 /128-1) + .quad 0x3FE573164B726DB6, 0x3FD11301D0125B51 //2^( 116 /128-1) - 2^(- 116 /128-1), 2^(- 116 /128-1) + .quad 0x3FE5A892BCF6379B, 0x3FD0FB66AFFED31B //2^( 117 /128-1) - 2^(- 117 /128-1), 2^(- 117 /128-1) + .quad 0x3FE5DE38CE215725, 0x3FD0E3EC32D3D1A2 //2^( 118 /128-1) - 2^(- 118 /128-1), 2^(- 118 /128-1) + .quad 0x3FE61408E60E2888, 0x3FD0CC922B7247F7 //2^( 119 /128-1) - 2^(- 119 /128-1), 2^(- 119 /128-1) + .quad 0x3FE64A036C27CC52, 0x3FD0B5586CF9890F //2^( 120 /128-1) - 2^(- 120 /128-1), 2^(- 120 /128-1) + .quad 0x3FE68028C82AEE2F, 0x3FD09E3ECAC6F383 //2^( 121 /128-1) - 2^(- 121 /128-1), 2^(- 121 /128-1) + .quad 0x3FE6B67962268C43, 0x3FD0874518759BC8 //2^( 122 /128-1) - 2^(- 122 /128-1), 2^(- 122 /128-1) + .quad 0x3FE6ECF5A27CBF28, 0x3FD0706B29DDF6DE //2^( 123 /128-1) - 2^(- 123 /128-1), 2^(- 123 /128-1) + .quad 0x3FE7239DF1E38286, 0x3FD059B0D3158574 //2^( 124 /128-1) - 2^(- 124 /128-1), 2^(- 124 /128-1) + .quad 0x3FE75A72B9657E51, 0x3FD04315E86E7F85 //2^( 125 /128-1) - 2^(- 125 /128-1), 2^(- 125 /128-1) + .quad 0x3FE791746262D0A8, 0x3FD02C9A3E778061 //2^( 126 /128-1) - 2^(- 126 /128-1), 2^(- 126 /128-1) + .quad 0x3FE7C8A35691D856, 0x3FD0163DA9FB3335 //2^( 127 /128-1) - 2^(- 127 /128-1), 2^(- 127 /128-1) + .align 64 + .quad 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000, 0x42C8000000000000 /* _dbShifter = 1.5 * 2^(52-k)*/ + .align 64 + .long 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99, 0x40861d99 /* _iDomainRange 0x40861d9ac12a3e85 =(1021*2^K-0.5)*log(2)/2^K -needed for quick exp*/ + .align 64 + .quad 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD, 0x3FDFFFFFFFFFFDBD /* _dPC2 */ + .align 64 + .quad 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD, 0x3FC55555555554AD /* _dPC3 */ + .align 64 + .quad 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299, 0x3FA55555CF16D299 /* _dPC4 */ + .align 64 + .quad 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425, 0x3F8111115712F425 /* _dPC5 */ + .align 64 + .quad 0x000000000000007f, 0x000000000000007f, 0x000000000000007f, 0x000000000000007f, 0x000000000000007f, 0x000000000000007f, 0x000000000000007f, 0x000000000000007f /* _lIndexMask */ + .align 64 + .type __svml_dsinh_data_internal,@object + .size __svml_dsinh_data_internal,.-__svml_dsinh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core-avx2.S new file mode 100644 index 0000000000..06525b7b37 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized sinhf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_sinhf _ZGVeN16v_sinhf_avx2_wrapper +#include "../svml_s_sinhf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core.c new file mode 100644 index 0000000000..6a954caa37 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized sinhf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_sinhf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_sinhf, __GI__ZGVeN16v_sinhf, + __redirect__ZGVeN16v_sinhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core_avx512.S new file mode 100644 index 0000000000..1119c00259 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf16_core_avx512.S @@ -0,0 +1,318 @@ +/* Function sinhf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute sinh(x) as (exp(x)-exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * sinh(NaN) = quiet NaN, and raise invalid exception + * sinh(INF) = that INF + * sinh(x) = x for subnormals + * sinh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_ssinh_data_internal + */ +#define _sInvLn2 0 +#define _sLn2hi 64 +#define _sLn2lo 128 +#define _sSign 192 +#define _sShifter 256 +#define _iDomainRange 320 +#define _sPC1 384 +#define _sPC2 448 +#define _sPC3 512 +#define _sPC4 576 +#define _sPC5 640 +#define _sPC6 704 +#define _iHalf 768 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_sinhf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovaps %zmm0, %zmm5 + +/* + * Implementation + * Abs argument + */ + vandps _sSign+__svml_ssinh_data_internal(%rip), %zmm5, %zmm4 + +/* + * Check for overflow\underflow + * MORE faster than GE? + */ + vpternlogd $255, %zmm6, %zmm6, %zmm6 + vmovups _sShifter+__svml_ssinh_data_internal(%rip), %zmm7 + +/* + * Load argument + * dM = x/log(2) + RShifter + */ + vmovups _sInvLn2+__svml_ssinh_data_internal(%rip), %zmm11 + vmovups _sLn2hi+__svml_ssinh_data_internal(%rip), %zmm8 + vmovups _sLn2lo+__svml_ssinh_data_internal(%rip), %zmm10 + vmovups _iHalf+__svml_ssinh_data_internal(%rip), %zmm12 + vmovups _sPC5+__svml_ssinh_data_internal(%rip), %zmm0 + vmovups _sPC6+__svml_ssinh_data_internal(%rip), %zmm3 + +/* x^2 */ + vmovups _sPC2+__svml_ssinh_data_internal(%rip), %zmm2 + vxorps %zmm5, %zmm4, %zmm1 + vfmadd213ps {rn-sae}, %zmm7, %zmm1, %zmm11 + vpcmpd $2, _iDomainRange+__svml_ssinh_data_internal(%rip), %zmm1, %k1 + +/* + * G1,G2 2^N,2^(-N) + * iM now is an EXP(2^N) + */ + vpslld $23, %zmm11, %zmm13 + +/* + * R + * sN = sM - RShifter + */ + vsubps {rn-sae}, %zmm7, %zmm11, %zmm9 + vpaddd %zmm13, %zmm12, %zmm14 + vpsubd %zmm13, %zmm12, %zmm15 + +/* sG1 = 2^(N-1)+2^(-N-1) */ + vaddps {rn-sae}, %zmm15, %zmm14, %zmm7 + vpandnd %zmm1, %zmm1, %zmm6{%k1} + +/* sR = sX - sN*Log2_hi */ + vfnmadd231ps {rn-sae}, %zmm8, %zmm9, %zmm1 + vptestmd %zmm6, %zmm6, %k0 + +/* sG2 = 2^(N-1)-2^(-N-1) */ + vsubps {rn-sae}, %zmm15, %zmm14, %zmm8 + +/* sR = (sX - sN*Log2_hi) - sN*Log2_lo */ + vfnmadd231ps {rn-sae}, %zmm10, %zmm9, %zmm1 + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*(a5+{v1 r^2*a7})))) = r + r*(r^2*(a3+r^2*(a5+r^2*a7))) .... + * sSinh_r = (a3+r^2*a5) + */ + vmovups _sPC3+__svml_ssinh_data_internal(%rip), %zmm14 + kmovw %k0, %edx + +/* sR2 = sR^2 */ + vmulps {rn-sae}, %zmm1, %zmm1, %zmm6 + vfmadd231ps {rn-sae}, %zmm6, %zmm0, %zmm14 + +/* sSinh_r = r^2*(a3+r^2*a5) */ + vmulps {rn-sae}, %zmm6, %zmm14, %zmm0 + +/* sSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + vfmadd213ps {rn-sae}, %zmm1, %zmm1, %zmm0 + +/* + * sinh(X) = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) + * sOut = (a4 +a6*sR2) + */ + vmovups _sPC4+__svml_ssinh_data_internal(%rip), %zmm1 + vfmadd231ps {rn-sae}, %zmm6, %zmm3, %zmm1 + +/* sOut = a2+sR2*(a4+a6*sR2) */ + vfmadd213ps {rn-sae}, %zmm2, %zmm6, %zmm1 + +/* sOut = sR2*(a2+sR2*(a4+a6*sR2) */ + vmulps {rn-sae}, %zmm6, %zmm1, %zmm2 + +/* sOut = sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vmulps {rn-sae}, %zmm8, %zmm2, %zmm3 + +/* sOut = sG1*sinh(dR)+sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vfmadd213ps {rn-sae}, %zmm3, %zmm0, %zmm7 + +/* sOut = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vaddps {rn-sae}, %zmm8, %zmm7, %zmm9 + +/* Ret H */ + vorps %zmm9, %zmm4, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm5 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm5, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call sinhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_sinhf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_ssinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _sInvLn2[16][1]; + __declspec(align(64)) VUINT32 _sLn2hi[16][1]; + __declspec(align(64)) VUINT32 _sLn2lo[16][1]; + __declspec(align(64)) VUINT32 _sSign[16][1]; + __declspec(align(64)) VUINT32 _sShifter[16][1]; + __declspec(align(64)) VUINT32 _iDomainRange[16][1]; + __declspec(align(64)) VUINT32 _sPC1[16][1]; + __declspec(align(64)) VUINT32 _sPC2[16][1]; + __declspec(align(64)) VUINT32 _sPC3[16][1]; + __declspec(align(64)) VUINT32 _sPC4[16][1]; + __declspec(align(64)) VUINT32 _sPC5[16][1]; + __declspec(align(64)) VUINT32 _sPC6[16][1]; + __declspec(align(64)) VUINT32 _iHalf[16][1]; +} __svml_ssinh_data_internal; +#endif +__svml_ssinh_data_internal: + .long 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B /* _sInvLn2 */ //k=0 + .align 64 + .long 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000 /* _sLn2hi */ + .align 64 + .long 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4 /* _sLn2lo */ + .align 64 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSign */ + .align 64 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 64 + .long 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E /* _iDomainRange */ + .align 64 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC1=1 */ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sPC2 */ + .align 64 + .long 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57 /* _sPC3 */ + .align 64 + .long 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72 /* _sPC4 */ + .align 64 + .long 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461 /* _sPC5 */ + .align 64 + .long 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3 /* _sPC6 */ + // Integer constants + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _iHalf*/ + .align 64 + .type __svml_ssinh_data_internal,@object + .size __svml_ssinh_data_internal,.-__svml_ssinh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core-sse2.S new file mode 100644 index 0000000000..1b31095fe1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized sinhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_sinhf _ZGVbN4v_sinhf_sse2 +#include "../svml_s_sinhf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core.c new file mode 100644 index 0000000000..9d4297c2c9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized sinhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_sinhf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_sinhf, __GI__ZGVbN4v_sinhf, + __redirect__ZGVbN4v_sinhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core_sse4.S new file mode 100644 index 0000000000..82d6f55d33 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf4_core_sse4.S @@ -0,0 +1,308 @@ +/* Function sinhf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute sinh(x) as (exp(x)-exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * sinh(NaN) = quiet NaN, and raise invalid exception + * sinh(INF) = that INF + * sinh(x) = x for subnormals + * sinh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_ssinh_data_internal + */ +#define _sInvLn2 0 +#define _sLn2hi 16 +#define _sLn2lo 32 +#define _sSign 48 +#define _sShifter 64 +#define _iDomainRange 80 +#define _sPC1 96 +#define _sPC2 112 +#define _sPC3 128 +#define _sPC4 144 +#define _sPC5 160 +#define _sPC6 176 +#define _iHalf 192 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_sinhf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* + * Implementation + * Abs argument + */ + movups _sSign+__svml_ssinh_data_internal(%rip), %xmm14 + andps %xmm0, %xmm14 + movaps %xmm14, %xmm10 + +/* + * Load argument + * dM = x/log(2) + RShifter + */ + movups _sInvLn2+__svml_ssinh_data_internal(%rip), %xmm7 + pxor %xmm0, %xmm10 + mulps %xmm10, %xmm7 + +/* + * Check for overflow\underflow + * MORE faster than GE? + */ + movaps %xmm10, %xmm1 + movups _sShifter+__svml_ssinh_data_internal(%rip), %xmm2 + +/* sR = sX - sN*Log2_hi */ + movups _sLn2hi+__svml_ssinh_data_internal(%rip), %xmm3 + addps %xmm2, %xmm7 + +/* + * R + * sN = sM - RShifter + */ + movaps %xmm7, %xmm4 + +/* + * G1,G2 2^N,2^(-N) + * iM now is an EXP(2^N) + */ + pslld $23, %xmm7 + +/* sR = (sX - sN*Log2_hi) - sN*Log2_lo */ + movups _sLn2lo+__svml_ssinh_data_internal(%rip), %xmm5 + subps %xmm2, %xmm4 + mulps %xmm4, %xmm3 + mulps %xmm4, %xmm5 + subps %xmm3, %xmm10 + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*(a5+{v1 r^2*a7})))) = r + r*(r^2*(a3+r^2*(a5+r^2*a7))) .... + * sSinh_r = (a3+r^2*a5) + */ + movups _sPC5+__svml_ssinh_data_internal(%rip), %xmm8 + subps %xmm5, %xmm10 + +/* sR2 = sR^2 */ + movaps %xmm10, %xmm12 + mulps %xmm10, %xmm12 + +/* + * sinh(X) = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) + * sOut = (a4 +a6*sR2) + */ + movups _sPC6+__svml_ssinh_data_internal(%rip), %xmm9 + mulps %xmm12, %xmm8 + mulps %xmm12, %xmm9 + addps _sPC3+__svml_ssinh_data_internal(%rip), %xmm8 + addps _sPC4+__svml_ssinh_data_internal(%rip), %xmm9 + +/* sSinh_r = r^2*(a3+r^2*a5) */ + mulps %xmm12, %xmm8 + +/* sOut = a2+sR2*(a4+a6*sR2) */ + mulps %xmm12, %xmm9 + +/* sSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + mulps %xmm10, %xmm8 + addps _sPC2+__svml_ssinh_data_internal(%rip), %xmm9 + addps %xmm8, %xmm10 + +/* sOut = sR2*(a2+sR2*(a4+a6*sR2) */ + mulps %xmm9, %xmm12 + movdqu _iHalf+__svml_ssinh_data_internal(%rip), %xmm6 + movdqa %xmm6, %xmm13 + psubd %xmm7, %xmm6 + paddd %xmm7, %xmm13 + +/* sG1 = 2^(N-1)+2^(-N-1) */ + movdqa %xmm13, %xmm11 + +/* sG2 = 2^(N-1)-2^(-N-1) */ + subps %xmm6, %xmm13 + addps %xmm6, %xmm11 + +/* sOut = sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + mulps %xmm13, %xmm12 + +/* sOut = sG1*sinh(dR)+sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + mulps %xmm10, %xmm11 + pcmpgtd _iDomainRange+__svml_ssinh_data_internal(%rip), %xmm1 + addps %xmm11, %xmm12 + movmskps %xmm1, %edx + +/* sOut = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + addps %xmm12, %xmm13 + +/* Ret H */ + orps %xmm13, %xmm14 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm14 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm14, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm14, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm14 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm14 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call sinhf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_sinhf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_ssinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _sInvLn2[4][1]; + __declspec(align(16)) VUINT32 _sLn2hi[4][1]; + __declspec(align(16)) VUINT32 _sLn2lo[4][1]; + __declspec(align(16)) VUINT32 _sSign[4][1]; + __declspec(align(16)) VUINT32 _sShifter[4][1]; + __declspec(align(16)) VUINT32 _iDomainRange[4][1]; + __declspec(align(16)) VUINT32 _sPC1[4][1]; + __declspec(align(16)) VUINT32 _sPC2[4][1]; + __declspec(align(16)) VUINT32 _sPC3[4][1]; + __declspec(align(16)) VUINT32 _sPC4[4][1]; + __declspec(align(16)) VUINT32 _sPC5[4][1]; + __declspec(align(16)) VUINT32 _sPC6[4][1]; + __declspec(align(16)) VUINT32 _iHalf[4][1]; +} __svml_ssinh_data_internal; +#endif +__svml_ssinh_data_internal: + .long 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B /* _sInvLn2 */ //k=0 + .align 16 + .long 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000 /* _sLn2hi */ + .align 16 + .long 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4 /* _sLn2lo */ + .align 16 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSign */ + .align 16 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 16 + .long 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E /* _iDomainRange */ + .align 16 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC1=1 */ + .align 16 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sPC2 */ + .align 16 + .long 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57 /* _sPC3 */ + .align 16 + .long 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72 /* _sPC4 */ + .align 16 + .long 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461 /* _sPC5 */ + .align 16 + .long 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3 /* _sPC6 */ + // Integer constants + .align 16 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _iHalf*/ + .align 16 + .type __svml_ssinh_data_internal,@object + .size __svml_ssinh_data_internal,.-__svml_ssinh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core-sse.S new file mode 100644 index 0000000000..d3c9c607a0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized sinhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_sinhf _ZGVdN8v_sinhf_sse_wrapper +#include "../svml_s_sinhf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core.c new file mode 100644 index 0000000000..2a2e21e742 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized sinhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_sinhf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_sinhf, __GI__ZGVdN8v_sinhf, + __redirect__ZGVdN8v_sinhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core_avx2.S new file mode 100644 index 0000000000..ea13fb60d4 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_sinhf8_core_avx2.S @@ -0,0 +1,309 @@ +/* Function sinhf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute sinh(x) as (exp(x)-exp(-x))/2, + * where exp is calculated as + * exp(M*ln2 + ln2*(j/2^k) + r) = 2^M * 2^(j/2^k) * exp(r) + * + * Special cases: + * + * sinh(NaN) = quiet NaN, and raise invalid exception + * sinh(INF) = that INF + * sinh(x) = x for subnormals + * sinh(x) overflows for big x and returns MAXLOG+log(2) + * + */ + +/* Offsets for data table __svml_ssinh_data_internal + */ +#define _sInvLn2 0 +#define _sLn2hi 32 +#define _sLn2lo 64 +#define _sSign 96 +#define _sShifter 128 +#define _iDomainRange 160 +#define _sPC1 192 +#define _sPC2 224 +#define _sPC3 256 +#define _sPC4 288 +#define _sPC5 320 +#define _sPC6 352 +#define _iHalf 384 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_sinhf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + vmovups _sInvLn2+__svml_ssinh_data_internal(%rip), %ymm7 + vmovups _sShifter+__svml_ssinh_data_internal(%rip), %ymm4 + vmovups _sLn2hi+__svml_ssinh_data_internal(%rip), %ymm5 + +/* + * sinh(X) = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) + * sOut = (a4 +a6*sR2) + */ + vmovups _sPC6+__svml_ssinh_data_internal(%rip), %ymm14 + +/* + * sinh(r) = r*((a1=1)+r^2*(a3+r^2*(a5+{v1 r^2*a7})))) = r + r*(r^2*(a3+r^2*(a5+r^2*a7))) .... + * sSinh_r = (a3+r^2*a5) + */ + vmovups _sPC5+__svml_ssinh_data_internal(%rip), %ymm12 + vmovups _iHalf+__svml_ssinh_data_internal(%rip), %ymm8 + vmovaps %ymm0, %ymm2 + +/* + * Implementation + * Abs argument + */ + vandps _sSign+__svml_ssinh_data_internal(%rip), %ymm2, %ymm1 + vxorps %ymm2, %ymm1, %ymm0 + +/* + * Load argument + * dM = x/log(2) + RShifter + */ + vfmadd213ps %ymm4, %ymm0, %ymm7 + +/* + * R + * sN = sM - RShifter + */ + vsubps %ymm4, %ymm7, %ymm6 + +/* + * G1,G2 2^N,2^(-N) + * iM now is an EXP(2^N) + */ + vpslld $23, %ymm7, %ymm9 + +/* + * Check for overflow\underflow + * MORE faster than GE? + */ + vpcmpgtd _iDomainRange+__svml_ssinh_data_internal(%rip), %ymm0, %ymm3 + +/* sR = sX - sN*Log2_hi */ + vfnmadd231ps %ymm5, %ymm6, %ymm0 + vpaddd %ymm9, %ymm8, %ymm10 + vpsubd %ymm9, %ymm8, %ymm11 + +/* sR = (sX - sN*Log2_hi) - sN*Log2_lo */ + vfnmadd231ps _sLn2lo+__svml_ssinh_data_internal(%rip), %ymm6, %ymm0 + +/* sR2 = sR^2 */ + vmulps %ymm0, %ymm0, %ymm13 + vfmadd213ps _sPC4+__svml_ssinh_data_internal(%rip), %ymm13, %ymm14 + vfmadd213ps _sPC3+__svml_ssinh_data_internal(%rip), %ymm13, %ymm12 + +/* sOut = a2+sR2*(a4+a6*sR2) */ + vfmadd213ps _sPC2+__svml_ssinh_data_internal(%rip), %ymm13, %ymm14 + +/* sSinh_r = r^2*(a3+r^2*a5) */ + vmulps %ymm12, %ymm13, %ymm12 + +/* sOut = sR2*(a2+sR2*(a4+a6*sR2) */ + vmulps %ymm14, %ymm13, %ymm15 + +/* sSinh_r = r + r*(r^2*(a3+r^2*a5)) */ + vfmadd213ps %ymm0, %ymm0, %ymm12 + vmovmskps %ymm3, %edx + +/* sG1 = 2^(N-1)+2^(-N-1) */ + vaddps %ymm11, %ymm10, %ymm3 + +/* sG2 = 2^(N-1)-2^(-N-1) */ + vsubps %ymm11, %ymm10, %ymm10 + +/* sOut = sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vmulps %ymm15, %ymm10, %ymm0 + +/* sOut = sG1*sinh(dR)+sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vfmadd213ps %ymm0, %ymm12, %ymm3 + +/* sOut = sG2 + sG1*sinh(dR) + sG2*sR2*(a2+sR2*(a4+a6*sR2) */ + vaddps %ymm3, %ymm10, %ymm4 + +/* Ret H */ + vorps %ymm4, %ymm1, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm2, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call sinhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_sinhf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_ssinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _sInvLn2[8][1]; + __declspec(align(32)) VUINT32 _sLn2hi[8][1]; + __declspec(align(32)) VUINT32 _sLn2lo[8][1]; + __declspec(align(32)) VUINT32 _sSign[8][1]; + __declspec(align(32)) VUINT32 _sShifter[8][1]; + __declspec(align(32)) VUINT32 _iDomainRange[8][1]; + __declspec(align(32)) VUINT32 _sPC1[8][1]; + __declspec(align(32)) VUINT32 _sPC2[8][1]; + __declspec(align(32)) VUINT32 _sPC3[8][1]; + __declspec(align(32)) VUINT32 _sPC4[8][1]; + __declspec(align(32)) VUINT32 _sPC5[8][1]; + __declspec(align(32)) VUINT32 _sPC6[8][1]; + __declspec(align(32)) VUINT32 _iHalf[8][1]; +} __svml_ssinh_data_internal; +#endif +__svml_ssinh_data_internal: + .long 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B, 0x3FB8AA3B /* _sInvLn2 */ //k=0 + .align 32 + .long 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000, 0x3F317000 /* _sLn2hi */ + .align 32 + .long 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4, 0x3805FDF4 /* _sLn2lo */ + .align 32 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSign */ + .align 32 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 /* _sShifter */ + .align 32 + .long 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E, 0x42AEAC4E /* _iDomainRange */ + .align 32 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 /* _sPC1=1 */ + .align 32 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _sPC2 */ + .align 32 + .long 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57, 0x3e2aaa57 /* _sPC3 */ + .align 32 + .long 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72, 0x3d2aaa72 /* _sPC4 */ + .align 32 + .long 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461, 0x3c091461 /* _sPC5 */ + .align 32 + .long 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3, 0x3ab6a8a3 /* _sPC6 */ + // Integer constants + .align 32 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 /* _iHalf*/ + .align 32 + .type __svml_ssinh_data_internal,@object + .size __svml_ssinh_data_internal,.-__svml_ssinh_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_sinh2_core.S b/sysdeps/x86_64/fpu/svml_d_sinh2_core.S new file mode 100644 index 0000000000..91bda7318c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_sinh2_core.S @@ -0,0 +1,29 @@ +/* Function sinh vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_sinh) +WRAPPER_IMPL_SSE2 sinh +END (_ZGVbN2v_sinh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_sinh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_sinh4_core.S b/sysdeps/x86_64/fpu/svml_d_sinh4_core.S new file mode 100644 index 0000000000..7b8091946a --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_sinh4_core.S @@ -0,0 +1,29 @@ +/* Function sinh vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_sinh) +WRAPPER_IMPL_AVX _ZGVbN2v_sinh +END (_ZGVdN4v_sinh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_sinh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_sinh4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_sinh4_core_avx.S new file mode 100644 index 0000000000..f773bf110c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_sinh4_core_avx.S @@ -0,0 +1,25 @@ +/* Function sinh vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_sinh) +WRAPPER_IMPL_AVX _ZGVbN2v_sinh +END (_ZGVcN4v_sinh) diff --git a/sysdeps/x86_64/fpu/svml_d_sinh8_core.S b/sysdeps/x86_64/fpu/svml_d_sinh8_core.S new file mode 100644 index 0000000000..153a18429c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_sinh8_core.S @@ -0,0 +1,25 @@ +/* Function sinh vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_sinh) +WRAPPER_IMPL_AVX512 _ZGVdN4v_sinh +END (_ZGVeN8v_sinh) diff --git a/sysdeps/x86_64/fpu/svml_s_sinhf16_core.S b/sysdeps/x86_64/fpu/svml_s_sinhf16_core.S new file mode 100644 index 0000000000..f8dc7da336 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_sinhf16_core.S @@ -0,0 +1,25 @@ +/* Function sinhf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_sinhf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_sinhf +END (_ZGVeN16v_sinhf) diff --git a/sysdeps/x86_64/fpu/svml_s_sinhf4_core.S b/sysdeps/x86_64/fpu/svml_s_sinhf4_core.S new file mode 100644 index 0000000000..d065d03eb6 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_sinhf4_core.S @@ -0,0 +1,29 @@ +/* Function sinhf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_sinhf) +WRAPPER_IMPL_SSE2 sinhf +END (_ZGVbN4v_sinhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_sinhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_sinhf8_core.S b/sysdeps/x86_64/fpu/svml_s_sinhf8_core.S new file mode 100644 index 0000000000..1194699a76 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_sinhf8_core.S @@ -0,0 +1,29 @@ +/* Function sinhf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_sinhf) +WRAPPER_IMPL_AVX _ZGVbN4v_sinhf +END (_ZGVdN8v_sinhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_sinhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_sinhf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_sinhf8_core_avx.S new file mode 100644 index 0000000000..82c6b9b239 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_sinhf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function sinhf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_sinhf) +WRAPPER_IMPL_AVX _ZGVbN4v_sinhf +END (_ZGVcN8v_sinhf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx.c new file mode 100644 index 0000000000..55aa36d866 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-sinh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx2.c new file mode 100644 index 0000000000..55aa36d866 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-sinh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx512f.c new file mode 100644 index 0000000000..55aa36d866 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-sinh-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-sinh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-sinh.c b/sysdeps/x86_64/fpu/test-double-libmvec-sinh.c new file mode 100644 index 0000000000..82dcaf745d --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-sinh.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC sinh +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 0222f9f5b8..db136cc901 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVbN2v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVbN2v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVbN2v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVbN2v_expm1) +VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVbN2v_sinh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 1aad9faf9c..5fc09ac8c0 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVdN4v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVdN4v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVdN4v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVdN4v_expm1) +VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVdN4v_sinh) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index e404bf899d..26ef7fb365 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVcN4v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVcN4v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVcN4v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVcN4v_expm1) +VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVcN4v_sinh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 2b4de59343..c7055fca76 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2), _ZGVeN8v_exp2) VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVeN8v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVeN8v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVeN8v_expm1) +VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVeN8v_sinh) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx.c new file mode 100644 index 0000000000..93986945f3 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-sinhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx2.c new file mode 100644 index 0000000000..93986945f3 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-sinhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx512f.c new file mode 100644 index 0000000000..93986945f3 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-sinhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-sinhf.c b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf.c new file mode 100644 index 0000000000..fb1f3c5c48 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-sinhf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC sinhf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 9a4a1b84a9..d353bcb0f2 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVeN16v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVeN16v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVeN16v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVeN16v_expm1f) +VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVeN16v_sinhf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index eb4e36d0e2..5e59117626 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVbN4v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVbN4v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVbN4v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVbN4v_expm1f) +VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVbN4v_sinhf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index d8adab59e6..e884a5f4df 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVdN8v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVdN8v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVdN8v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVdN8v_expm1f) +VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVdN8v_sinhf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index e6e1a90c72..95910d39e9 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -35,6 +35,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp2f), _ZGVcN8v_exp2f) VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVcN8v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVcN8v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVcN8v_expm1f) +VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVcN8v_sinhf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573824 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=MZlKSb54; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmHm6CKMz9sVq for ; Wed, 29 Dec 2021 07:25:24 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E7F2C3858406 for ; Tue, 28 Dec 2021 20:25:21 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E7F2C3858406 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640723122; bh=MmzzQMl6TkhAUAiKXYQ6YM4j+wludqcQgDDalozPkaY=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=MZlKSb5435pVw1nmaj6cq4eQuja8DYdun67BFxAdoUncDAZgZNxnuaQDwaMQQWkLI 2sOIYrNHEQiRKwkdwdFl2MDaZJwXWASFoBTu1p2PCJLGitlB+iYy417Tz17G7YtZ0U 99kVRRk7lYrJfaj12hSRZufz+nmVXvWad/X692rY= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by sourceware.org (Postfix) with ESMTPS id 685383858439 for ; Tue, 28 Dec 2021 20:11:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 685383858439 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="238958495" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="238958495" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="468218878" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga003.jf.intel.com with ESMTP; 28 Dec 2021 12:11:32 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsd016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 09/18] x86-64: Add vector cbrt/cbrtf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:21 -0800 Message-Id: <20211228201130.737370-10-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized cbrt/cbrtf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector cbrt/cbrtf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_cbrt2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_cbrt2_core.c | 27 + .../fpu/multiarch/svml_d_cbrt2_core_sse4.S | 467 ++++++++++++++++ .../fpu/multiarch/svml_d_cbrt4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_cbrt4_core.c | 27 + .../fpu/multiarch/svml_d_cbrt4_core_avx2.S | 505 +++++++++++++++++ .../fpu/multiarch/svml_d_cbrt8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_cbrt8_core.c | 27 + .../fpu/multiarch/svml_d_cbrt8_core_avx512.S | 253 +++++++++ .../fpu/multiarch/svml_s_cbrtf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_cbrtf16_core.c | 28 + .../multiarch/svml_s_cbrtf16_core_avx512.S | 235 ++++++++ .../fpu/multiarch/svml_s_cbrtf4_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_s_cbrtf4_core.c | 28 + .../fpu/multiarch/svml_s_cbrtf4_core_sse4.S | 490 +++++++++++++++++ .../fpu/multiarch/svml_s_cbrtf8_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_s_cbrtf8_core.c | 28 + .../fpu/multiarch/svml_s_cbrtf8_core_avx2.S | 509 ++++++++++++++++++ sysdeps/x86_64/fpu/svml_d_cbrt2_core.S | 29 + sysdeps/x86_64/fpu/svml_d_cbrt4_core.S | 29 + sysdeps/x86_64/fpu/svml_d_cbrt4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_cbrt8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_cbrtf16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_cbrtf4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_cbrtf8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_cbrtf8_core_avx.S | 25 + .../x86_64/fpu/test-double-libmvec-cbrt-avx.c | 1 + .../fpu/test-double-libmvec-cbrt-avx2.c | 1 + .../fpu/test-double-libmvec-cbrt-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-cbrt.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-cbrtf-avx.c | 1 + .../fpu/test-float-libmvec-cbrtf-avx2.c | 1 + .../fpu/test-float-libmvec-cbrtf-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-cbrtf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 3031 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cbrt2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cbrt4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cbrt4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_cbrt8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_cbrtf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_cbrtf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_cbrtf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_cbrtf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-cbrt.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-cbrtf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 6347320521..7f1304ed1d 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -197,4 +197,15 @@ #define __DECL_SIMD_sinhf32x #define __DECL_SIMD_sinhf64x #define __DECL_SIMD_sinhf128x + +#define __DECL_SIMD_cbrt +#define __DECL_SIMD_cbrtf +#define __DECL_SIMD_cbrtl +#define __DECL_SIMD_cbrtf16 +#define __DECL_SIMD_cbrtf32 +#define __DECL_SIMD_cbrtf64 +#define __DECL_SIMD_cbrtf128 +#define __DECL_SIMD_cbrtf32x +#define __DECL_SIMD_cbrtf64x +#define __DECL_SIMD_cbrtf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 673b3a93ba..26d18f0135 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -149,7 +149,7 @@ __MATHCALL_VEC (hypot,, (_Mdouble_ __x, _Mdouble_ __y)); #if defined __USE_XOPEN_EXTENDED || defined __USE_ISOC99 /* Return the cube root of X. */ -__MATHCALL (cbrt,, (_Mdouble_ __x)); +__MATHCALL_VEC (cbrt,, (_Mdouble_ __x)); #endif diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index f9d7b085ab..a6558d9810 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -49,6 +49,7 @@ GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F +GLIBC_2.35 _ZGVbN2v_cbrt F GLIBC_2.35 _ZGVbN2v_cosh F GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F @@ -58,6 +59,7 @@ GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F +GLIBC_2.35 _ZGVbN4v_cbrtf F GLIBC_2.35 _ZGVbN4v_coshf F GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F @@ -67,6 +69,7 @@ GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F +GLIBC_2.35 _ZGVcN4v_cbrt F GLIBC_2.35 _ZGVcN4v_cosh F GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F @@ -76,6 +79,7 @@ GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F +GLIBC_2.35 _ZGVcN8v_cbrtf F GLIBC_2.35 _ZGVcN8v_coshf F GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F @@ -85,6 +89,7 @@ GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F +GLIBC_2.35 _ZGVdN4v_cbrt F GLIBC_2.35 _ZGVdN4v_cosh F GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F @@ -94,6 +99,7 @@ GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F +GLIBC_2.35 _ZGVdN8v_cbrtf F GLIBC_2.35 _ZGVdN8v_coshf F GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F @@ -103,6 +109,7 @@ GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F +GLIBC_2.35 _ZGVeN16v_cbrtf F GLIBC_2.35 _ZGVeN16v_coshf F GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F @@ -112,6 +119,7 @@ GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F +GLIBC_2.35 _ZGVeN8v_cbrt F GLIBC_2.35 _ZGVeN8v_cosh F GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 51a41cfebc..dcd45934ab 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -94,6 +94,10 @@ # define __DECL_SIMD_sinh __DECL_SIMD_x86_64 # undef __DECL_SIMD_sinhf # define __DECL_SIMD_sinhf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_cbrt +# define __DECL_SIMD_cbrt __DECL_SIMD_x86_64 +# undef __DECL_SIMD_cbrtf +# define __DECL_SIMD_cbrtf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 91e9b4fc83..dfb5f13ea3 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -46,6 +46,8 @@ !GCC$ builtin (expm1f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (sinh) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (sinhf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (cbrt) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (cbrtf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -77,3 +79,5 @@ !GCC$ builtin (expm1f) attributes simd (notinbranch) if('x32') !GCC$ builtin (sinh) attributes simd (notinbranch) if('x32') !GCC$ builtin (sinhf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (cbrt) attributes simd (notinbranch) if('x32') +!GCC$ builtin (cbrtf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 81e9fc95b2..dde737c0d6 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -25,6 +25,7 @@ libmvec-funcs = \ acos \ asin \ atan \ + cbrt \ cos \ cosh \ exp \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 2710446d12..b70aeb3e2f 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -17,6 +17,7 @@ libmvec { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; + _ZGVbN2v_cbrt; _ZGVcN4v_cbrt; _ZGVdN4v_cbrt; _ZGVeN8v_cbrt; _ZGVbN2v_cosh; _ZGVcN4v_cosh; _ZGVdN4v_cosh; _ZGVeN8v_cosh; _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; @@ -26,6 +27,7 @@ libmvec { _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; + _ZGVbN4v_cbrtf; _ZGVcN8v_cbrtf; _ZGVdN8v_cbrtf; _ZGVeN16v_cbrtf; _ZGVbN4v_coshf; _ZGVcN8v_coshf; _ZGVdN8v_coshf; _ZGVeN16v_coshf; _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index f4b313119d..e039a993df 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -583,6 +583,26 @@ float: 1 float128: 1 ldouble: 1 +Function: "cbrt_vlen16": +float: 1 + +Function: "cbrt_vlen2": +double: 1 + +Function: "cbrt_vlen4": +double: 1 +float: 2 + +Function: "cbrt_vlen4_avx2": +double: 1 + +Function: "cbrt_vlen8": +double: 1 +float: 2 + +Function: "cbrt_vlen8_avx2": +float: 2 + Function: Real part of "ccos": double: 1 float: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core-sse2.S new file mode 100644 index 0000000000..60f4c46a11 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized cbrt, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_cbrt _ZGVbN2v_cbrt_sse2 +#include "../svml_d_cbrt2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core.c new file mode 100644 index 0000000000..07390b7150 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized cbrt, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_cbrt +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_cbrt, __GI__ZGVbN2v_cbrt, __redirect__ZGVbN2v_cbrt) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core_sse4.S new file mode 100644 index 0000000000..72ecb25e05 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt2_core_sse4.S @@ -0,0 +1,467 @@ +/* Function cbrt vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * x=2^{3*k+j} * 1.b1 b2 ... b5 b6 ... b52 + * Let r=(x*2^{-3k-j} - 1.b1 b2 ... b5 1)* rcp[b1 b2 ..b5], + * where rcp[b1 b2 .. b5]=1/(1.b1 b2 b3 b4 b5 1) in double precision + * cbrt(2^j * 1. b1 b2 .. b5 1) is approximated as T[j][b1..b5]+D[j][b1..b5] + * (T stores the high 53 bits, D stores the low order bits) + * Result=2^k*T+(2^k*T*r)*P+2^k*D + * where P=p1+p2*r+..+p8*r^7 + * + */ + +/* Offsets for data table __svml_dcbrt_data_internal + */ +#define _dRcp 0 +#define _dCbrtHiLo 256 +#define _dA7 1024 +#define _dA6 1040 +#define _dA5 1056 +#define _dA4 1072 +#define _dA3 1088 +#define _dA2 1104 +#define _dA1 1120 +#define _dNeg65Div64 1136 +#define _dSgnf6Mask 1152 +#define _dNegOne 1168 +#define _dMantissaMask 1184 +#define _lExpHiMask 1200 +#define _lExpLoMask 1216 +#define _l1556 1232 +#define _iRcpIndexMask 1248 +#define _iAbsMask 1264 +#define _iSignMask 1280 +#define _iBias 1296 +#define _iSub 1312 +#define _iCmp 1328 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_cbrt_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* Calculate CbrtIndex */ + movaps %xmm0, %xmm10 + psrlq $52, %xmm10 + +/* Load 1/(1+iRcpIndex/32+1/64) reciprocal table value */ + lea __svml_dcbrt_data_internal(%rip), %r8 + pand _lExpLoMask+__svml_dcbrt_data_internal(%rip), %xmm10 + movdqu _l1556+__svml_dcbrt_data_internal(%rip), %xmm9 + pmuludq %xmm10, %xmm9 + +/* If the exponent field is zero - go to callout to process denormals */ + movq _iAbsMask+__svml_dcbrt_data_internal(%rip), %xmm7 + +/* Calculate Rcp table index */ + movq _iRcpIndexMask+__svml_dcbrt_data_internal(%rip), %xmm13 + +/* Get iX - high part of argument */ + pshufd $221, %xmm0, %xmm4 + +/* + * Declarations + * Load constants + */ + movq _iSignMask+__svml_dcbrt_data_internal(%rip), %xmm1 + pand %xmm4, %xmm7 + pand %xmm4, %xmm13 + +/* Compute 2^k */ + psrld $20, %xmm4 + movq _iBias+__svml_dcbrt_data_internal(%rip), %xmm2 + pand %xmm1, %xmm4 + pshufd $136, %xmm9, %xmm15 + por %xmm2, %xmm4 + psrld $14, %xmm15 + psrld $12, %xmm13 + paddd %xmm15, %xmm4 + pxor %xmm2, %xmm2 + pslld $20, %xmm4 + movdqa %xmm15, %xmm11 + movd %xmm13, %edx + paddd %xmm15, %xmm11 + pshufd $1, %xmm13, %xmm8 + punpckldq %xmm4, %xmm2 + +/* + * VAND( L, l2k, = l2k, lExpHiMask ); + * Argument reduction Z + */ + movups _dMantissaMask+__svml_dcbrt_data_internal(%rip), %xmm1 + movups _dSgnf6Mask+__svml_dcbrt_data_internal(%rip), %xmm4 + andps %xmm0, %xmm1 + movd %xmm8, %ecx + andps %xmm0, %xmm4 + orps _dNegOne+__svml_dcbrt_data_internal(%rip), %xmm1 + orps _dNeg65Div64+__svml_dcbrt_data_internal(%rip), %xmm4 + movslq %edx, %rdx + subpd %xmm4, %xmm1 + movslq %ecx, %rcx + movsd (%r8,%rdx), %xmm3 + movq _iSub+__svml_dcbrt_data_internal(%rip), %xmm5 + psubd %xmm5, %xmm7 + movhpd (%r8,%rcx), %xmm3 + mulpd %xmm1, %xmm3 + +/* Polynomial */ + movups _dA7+__svml_dcbrt_data_internal(%rip), %xmm5 + mulpd %xmm3, %xmm5 + addpd _dA6+__svml_dcbrt_data_internal(%rip), %xmm5 + mulpd %xmm3, %xmm5 + addpd _dA5+__svml_dcbrt_data_internal(%rip), %xmm5 + mulpd %xmm3, %xmm5 + addpd _dA4+__svml_dcbrt_data_internal(%rip), %xmm5 + mulpd %xmm3, %xmm5 + addpd _dA3+__svml_dcbrt_data_internal(%rip), %xmm5 + pshufd $136, %xmm10, %xmm12 + psubd %xmm15, %xmm12 + psubd %xmm11, %xmm12 + mulpd %xmm3, %xmm5 + pslld $8, %xmm12 + paddd %xmm12, %xmm13 + +/* Load cbrt(2^j*(1+iRcpIndex/32+1/64)) Hi & Lo values */ + movd %xmm13, %esi + pshufd $1, %xmm13, %xmm14 + movq _iCmp+__svml_dcbrt_data_internal(%rip), %xmm6 + movd %xmm14, %edi + pcmpgtd %xmm6, %xmm7 + movmskps %xmm7, %eax + addpd _dA2+__svml_dcbrt_data_internal(%rip), %xmm5 + movslq %esi, %rsi + movslq %edi, %rdi + mulpd %xmm3, %xmm5 + movsd 256(%r8,%rsi), %xmm6 + movhpd 256(%r8,%rdi), %xmm6 + +/* THi*2^k, TLo*2^k */ + mulpd %xmm2, %xmm6 + addpd _dA1+__svml_dcbrt_data_internal(%rip), %xmm5 + +/* THi*2^k*Z */ + mulpd %xmm6, %xmm3 + +/* Final reconstruction */ + mulpd %xmm3, %xmm5 + addpd %xmm5, %xmm6 + andl $3, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax xmm0 xmm6 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm6, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm6, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 eax xmm6 + + xorl %edx, %edx + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %edx, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %eax, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm6 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm6 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call cbrt@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2v_cbrt_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dcbrt_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dRcp[32][2]; + __declspec(align(16)) VUINT32 _dCbrtHiLo[96][2]; + __declspec(align(16)) VUINT32 _dA7[2][2]; + __declspec(align(16)) VUINT32 _dA6[2][2]; + __declspec(align(16)) VUINT32 _dA5[2][2]; + __declspec(align(16)) VUINT32 _dA4[2][2]; + __declspec(align(16)) VUINT32 _dA3[2][2]; + __declspec(align(16)) VUINT32 _dA2[2][2]; + __declspec(align(16)) VUINT32 _dA1[2][2]; + __declspec(align(16)) VUINT32 _dNeg65Div64[2][2]; + __declspec(align(16)) VUINT32 _dSgnf6Mask[2][2]; + __declspec(align(16)) VUINT32 _dNegOne[2][2]; + __declspec(align(16)) VUINT32 _dMantissaMask[2][2]; + __declspec(align(16)) VUINT32 _lExpHiMask[2][2]; + __declspec(align(16)) VUINT32 _lExpLoMask[2][2]; + __declspec(align(16)) VUINT32 _l1556[2][2]; + __declspec(align(16)) VUINT32 _iRcpIndexMask[4][1]; + __declspec(align(16)) VUINT32 _iAbsMask[4][1]; + __declspec(align(16)) VUINT32 _iSignMask[4][1]; + __declspec(align(16)) VUINT32 _iBias[4][1]; + __declspec(align(16)) VUINT32 _iSub[4][1]; + __declspec(align(16)) VUINT32 _iCmp[4][1]; +} __svml_dcbrt_data_internal; +#endif +__svml_dcbrt_data_internal: + /*== _dRcp ==*/ + .quad 0xBFEF81F81F81F820 /* (1/(1+0/32+1/64)) = -.984615 */ + .quad 0xBFEE9131ABF0B767 /* (1/(1+1/32+1/64)) = -.955224 */ + .quad 0xBFEDAE6076B981DB /* (1/(1+2/32+1/64)) = -.927536 */ + .quad 0xBFECD85689039B0B /* (1/(1+3/32+1/64)) = -.901408 */ + .quad 0xBFEC0E070381C0E0 /* (1/(1+4/32+1/64)) = -.876712 */ + .quad 0xBFEB4E81B4E81B4F /* (1/(1+5/32+1/64)) = -.853333 */ + .quad 0xBFEA98EF606A63BE /* (1/(1+6/32+1/64)) = -.831169 */ + .quad 0xBFE9EC8E951033D9 /* (1/(1+7/32+1/64)) = -.810127 */ + .quad 0xBFE948B0FCD6E9E0 /* (1/(1+8/32+1/64)) = -.790123 */ + .quad 0xBFE8ACB90F6BF3AA /* (1/(1+9/32+1/64)) = -.771084 */ + .quad 0xBFE8181818181818 /* (1/(1+10/32+1/64)) = -.752941 */ + .quad 0xBFE78A4C8178A4C8 /* (1/(1+11/32+1/64)) = -.735632 */ + .quad 0xBFE702E05C0B8170 /* (1/(1+12/32+1/64)) = -.719101 */ + .quad 0xBFE6816816816817 /* (1/(1+13/32+1/64)) = -.703297 */ + .quad 0xBFE6058160581606 /* (1/(1+14/32+1/64)) = -.688172 */ + .quad 0xBFE58ED2308158ED /* (1/(1+15/32+1/64)) = -.673684 */ + .quad 0xBFE51D07EAE2F815 /* (1/(1+16/32+1/64)) = -.659794 */ + .quad 0xBFE4AFD6A052BF5B /* (1/(1+17/32+1/64)) = -.646465 */ + .quad 0xBFE446F86562D9FB /* (1/(1+18/32+1/64)) = -.633663 */ + .quad 0xBFE3E22CBCE4A902 /* (1/(1+19/32+1/64)) = -.621359 */ + .quad 0xBFE3813813813814 /* (1/(1+20/32+1/64)) = -.609524 */ + .quad 0xBFE323E34A2B10BF /* (1/(1+21/32+1/64)) = -.598131 */ + .quad 0xBFE2C9FB4D812CA0 /* (1/(1+22/32+1/64)) = -.587156 */ + .quad 0xBFE27350B8812735 /* (1/(1+23/32+1/64)) = -.576577 */ + .quad 0xBFE21FB78121FB78 /* (1/(1+24/32+1/64)) = -.566372 */ + .quad 0xBFE1CF06ADA2811D /* (1/(1+25/32+1/64)) = -.556522 */ + .quad 0xBFE1811811811812 /* (1/(1+26/32+1/64)) = -.547009 */ + .quad 0xBFE135C81135C811 /* (1/(1+27/32+1/64)) = -.537815 */ + .quad 0xBFE0ECF56BE69C90 /* (1/(1+28/32+1/64)) = -.528926 */ + .quad 0xBFE0A6810A6810A7 /* (1/(1+29/32+1/64)) = -.520325 */ + .quad 0xBFE0624DD2F1A9FC /* (1/(1+30/32+1/64)) = -.512 */ + .quad 0xBFE0204081020408 /* (1/(1+31/32+1/64)) = -.503937 */ + /*== _dCbrtHiLo ==*/ + .align 16 + .quad 0x3FF01539221D4C97 /* HI((2^0*(1+0/32+1/64))^(1/3)) = 1.005181 */ + .quad 0x3FF03F06771A2E33 /* HI((2^0*(1+1/32+1/64))^(1/3)) = 1.015387 */ + .quad 0x3FF06800E629D671 /* HI((2^0*(1+2/32+1/64))^(1/3)) = 1.025391 */ + .quad 0x3FF090328731DEB2 /* HI((2^0*(1+3/32+1/64))^(1/3)) = 1.035204 */ + .quad 0x3FF0B7A4B1BD64AC /* HI((2^0*(1+4/32+1/64))^(1/3)) = 1.044835 */ + .quad 0x3FF0DE601024FB87 /* HI((2^0*(1+5/32+1/64))^(1/3)) = 1.054291 */ + .quad 0x3FF1046CB0597000 /* HI((2^0*(1+6/32+1/64))^(1/3)) = 1.06358 */ + .quad 0x3FF129D212A9BA9B /* HI((2^0*(1+7/32+1/64))^(1/3)) = 1.07271 */ + .quad 0x3FF14E9736CDAF38 /* HI((2^0*(1+8/32+1/64))^(1/3)) = 1.081687 */ + .quad 0x3FF172C2A772F507 /* HI((2^0*(1+9/32+1/64))^(1/3)) = 1.090518 */ + .quad 0x3FF1965A848001D3 /* HI((2^0*(1+10/32+1/64))^(1/3)) = 1.099207 */ + .quad 0x3FF1B9648C38C55D /* HI((2^0*(1+11/32+1/64))^(1/3)) = 1.107762 */ + .quad 0x3FF1DBE6236A0C45 /* HI((2^0*(1+12/32+1/64))^(1/3)) = 1.116186 */ + .quad 0x3FF1FDE45CBB1F9F /* HI((2^0*(1+13/32+1/64))^(1/3)) = 1.124485 */ + .quad 0x3FF21F63FF409042 /* HI((2^0*(1+14/32+1/64))^(1/3)) = 1.132664 */ + .quad 0x3FF240698C6746E5 /* HI((2^0*(1+15/32+1/64))^(1/3)) = 1.140726 */ + .quad 0x3FF260F9454BB99B /* HI((2^0*(1+16/32+1/64))^(1/3)) = 1.148675 */ + .quad 0x3FF281172F8E7073 /* HI((2^0*(1+17/32+1/64))^(1/3)) = 1.156516 */ + .quad 0x3FF2A0C719B4B6D0 /* HI((2^0*(1+18/32+1/64))^(1/3)) = 1.164252 */ + .quad 0x3FF2C00C9F2263EC /* HI((2^0*(1+19/32+1/64))^(1/3)) = 1.171887 */ + .quad 0x3FF2DEEB2BB7FB78 /* HI((2^0*(1+20/32+1/64))^(1/3)) = 1.179423 */ + .quad 0x3FF2FD65FF1EFBBC /* HI((2^0*(1+21/32+1/64))^(1/3)) = 1.186865 */ + .quad 0x3FF31B802FCCF6A2 /* HI((2^0*(1+22/32+1/64))^(1/3)) = 1.194214 */ + .quad 0x3FF3393CADC50708 /* HI((2^0*(1+23/32+1/64))^(1/3)) = 1.201474 */ + .quad 0x3FF3569E451E4C2A /* HI((2^0*(1+24/32+1/64))^(1/3)) = 1.208647 */ + .quad 0x3FF373A7A0554CDE /* HI((2^0*(1+25/32+1/64))^(1/3)) = 1.215736 */ + .quad 0x3FF3905B4A6D76CE /* HI((2^0*(1+26/32+1/64))^(1/3)) = 1.222743 */ + .quad 0x3FF3ACBBB0E756B6 /* HI((2^0*(1+27/32+1/64))^(1/3)) = 1.229671 */ + .quad 0x3FF3C8CB258FA340 /* HI((2^0*(1+28/32+1/64))^(1/3)) = 1.236522 */ + .quad 0x3FF3E48BE02AC0CE /* HI((2^0*(1+29/32+1/64))^(1/3)) = 1.243297 */ + .quad 0x3FF4000000000000 /* HI((2^0*(1+30/32+1/64))^(1/3)) = 1.25 */ + .quad 0x3FF41B298D47800E /* HI((2^0*(1+31/32+1/64))^(1/3)) = 1.256631 */ + .quad 0x3FF443604B34D9B2 /* HI((2^1*(1+0/32+1/64))^(1/3)) = 1.266449 */ + .quad 0x3FF4780B20906571 /* HI((2^1*(1+1/32+1/64))^(1/3)) = 1.279307 */ + .quad 0x3FF4ABAC3EE06706 /* HI((2^1*(1+2/32+1/64))^(1/3)) = 1.291912 */ + .quad 0x3FF4DE505DA66B8D /* HI((2^1*(1+3/32+1/64))^(1/3)) = 1.304276 */ + .quad 0x3FF51003420A5C07 /* HI((2^1*(1+4/32+1/64))^(1/3)) = 1.316409 */ + .quad 0x3FF540CFD6FD11C1 /* HI((2^1*(1+5/32+1/64))^(1/3)) = 1.328323 */ + .quad 0x3FF570C04260716B /* HI((2^1*(1+6/32+1/64))^(1/3)) = 1.340027 */ + .quad 0x3FF59FDDF7A45F38 /* HI((2^1*(1+7/32+1/64))^(1/3)) = 1.35153 */ + .quad 0x3FF5CE31C83539DF /* HI((2^1*(1+8/32+1/64))^(1/3)) = 1.36284 */ + .quad 0x3FF5FBC3F20966A4 /* HI((2^1*(1+9/32+1/64))^(1/3)) = 1.373966 */ + .quad 0x3FF6289C2C8F1B70 /* HI((2^1*(1+10/32+1/64))^(1/3)) = 1.384915 */ + .quad 0x3FF654C1B4316DCF /* HI((2^1*(1+11/32+1/64))^(1/3)) = 1.395693 */ + .quad 0x3FF6803B54A34E44 /* HI((2^1*(1+12/32+1/64))^(1/3)) = 1.406307 */ + .quad 0x3FF6AB0F72182659 /* HI((2^1*(1+13/32+1/64))^(1/3)) = 1.416763 */ + .quad 0x3FF6D544118C08BC /* HI((2^1*(1+14/32+1/64))^(1/3)) = 1.427067 */ + .quad 0x3FF6FEDEE0388D4A /* HI((2^1*(1+15/32+1/64))^(1/3)) = 1.437224 */ + .quad 0x3FF727E53A4F645E /* HI((2^1*(1+16/32+1/64))^(1/3)) = 1.44724 */ + .quad 0x3FF7505C31104114 /* HI((2^1*(1+17/32+1/64))^(1/3)) = 1.457119 */ + .quad 0x3FF77848904CD549 /* HI((2^1*(1+18/32+1/64))^(1/3)) = 1.466866 */ + .quad 0x3FF79FAEE36B2534 /* HI((2^1*(1+19/32+1/64))^(1/3)) = 1.476485 */ + .quad 0x3FF7C69379F4605B /* HI((2^1*(1+20/32+1/64))^(1/3)) = 1.48598 */ + .quad 0x3FF7ECFA6BBCA391 /* HI((2^1*(1+21/32+1/64))^(1/3)) = 1.495356 */ + .quad 0x3FF812E79CAE7EB9 /* HI((2^1*(1+22/32+1/64))^(1/3)) = 1.504615 */ + .quad 0x3FF8385EC043C71D /* HI((2^1*(1+23/32+1/64))^(1/3)) = 1.513762 */ + .quad 0x3FF85D635CB41B9D /* HI((2^1*(1+24/32+1/64))^(1/3)) = 1.5228 */ + .quad 0x3FF881F8CDE083DB /* HI((2^1*(1+25/32+1/64))^(1/3)) = 1.531731 */ + .quad 0x3FF8A6224802B8A8 /* HI((2^1*(1+26/32+1/64))^(1/3)) = 1.54056 */ + .quad 0x3FF8C9E2DA25E5E4 /* HI((2^1*(1+27/32+1/64))^(1/3)) = 1.549289 */ + .quad 0x3FF8ED3D706E1010 /* HI((2^1*(1+28/32+1/64))^(1/3)) = 1.55792 */ + .quad 0x3FF91034D632B6DF /* HI((2^1*(1+29/32+1/64))^(1/3)) = 1.566457 */ + .quad 0x3FF932CBB7F0CF2D /* HI((2^1*(1+30/32+1/64))^(1/3)) = 1.574901 */ + .quad 0x3FF95504A517BF3A /* HI((2^1*(1+31/32+1/64))^(1/3)) = 1.583256 */ + .quad 0x3FF987AF34F8BB19 /* HI((2^2*(1+0/32+1/64))^(1/3)) = 1.595626 */ + .quad 0x3FF9CA0A8337B317 /* HI((2^2*(1+1/32+1/64))^(1/3)) = 1.611826 */ + .quad 0x3FFA0B1709CC13D5 /* HI((2^2*(1+2/32+1/64))^(1/3)) = 1.627708 */ + .quad 0x3FFA4AE4CE6419ED /* HI((2^2*(1+3/32+1/64))^(1/3)) = 1.643285 */ + .quad 0x3FFA8982A5567031 /* HI((2^2*(1+4/32+1/64))^(1/3)) = 1.658572 */ + .quad 0x3FFAC6FE500AB570 /* HI((2^2*(1+5/32+1/64))^(1/3)) = 1.673582 */ + .quad 0x3FFB036497A15A17 /* HI((2^2*(1+6/32+1/64))^(1/3)) = 1.688328 */ + .quad 0x3FFB3EC164671755 /* HI((2^2*(1+7/32+1/64))^(1/3)) = 1.702821 */ + .quad 0x3FFB791FD288C46F /* HI((2^2*(1+8/32+1/64))^(1/3)) = 1.717071 */ + .quad 0x3FFBB28A44693BE4 /* HI((2^2*(1+9/32+1/64))^(1/3)) = 1.731089 */ + .quad 0x3FFBEB0A72EB6E31 /* HI((2^2*(1+10/32+1/64))^(1/3)) = 1.744883 */ + .quad 0x3FFC22A97BF5F697 /* HI((2^2*(1+11/32+1/64))^(1/3)) = 1.758462 */ + .quad 0x3FFC596FEF6AF983 /* HI((2^2*(1+12/32+1/64))^(1/3)) = 1.771835 */ + .quad 0x3FFC8F65DAC655A3 /* HI((2^2*(1+13/32+1/64))^(1/3)) = 1.785009 */ + .quad 0x3FFCC492D38CE8D9 /* HI((2^2*(1+14/32+1/64))^(1/3)) = 1.797992 */ + .quad 0x3FFCF8FE00B19367 /* HI((2^2*(1+15/32+1/64))^(1/3)) = 1.810789 */ + .quad 0x3FFD2CAE230F8709 /* HI((2^2*(1+16/32+1/64))^(1/3)) = 1.823408 */ + .quad 0x3FFD5FA99D15208F /* HI((2^2*(1+17/32+1/64))^(1/3)) = 1.835855 */ + .quad 0x3FFD91F679B6E505 /* HI((2^2*(1+18/32+1/64))^(1/3)) = 1.848135 */ + .quad 0x3FFDC39A72BF2302 /* HI((2^2*(1+19/32+1/64))^(1/3)) = 1.860255 */ + .quad 0x3FFDF49AF68C1570 /* HI((2^2*(1+20/32+1/64))^(1/3)) = 1.872218 */ + .quad 0x3FFE24FD2D4C23B8 /* HI((2^2*(1+21/32+1/64))^(1/3)) = 1.884031 */ + .quad 0x3FFE54C5FDC5EC73 /* HI((2^2*(1+22/32+1/64))^(1/3)) = 1.895697 */ + .quad 0x3FFE83FA11B81DBB /* HI((2^2*(1+23/32+1/64))^(1/3)) = 1.907221 */ + .quad 0x3FFEB29DD9DBAF25 /* HI((2^2*(1+24/32+1/64))^(1/3)) = 1.918608 */ + .quad 0x3FFEE0B59191D374 /* HI((2^2*(1+25/32+1/64))^(1/3)) = 1.929861 */ + .quad 0x3FFF0E454245E4BF /* HI((2^2*(1+26/32+1/64))^(1/3)) = 1.940984 */ + .quad 0x3FFF3B50C68A9DD3 /* HI((2^2*(1+27/32+1/64))^(1/3)) = 1.951981 */ + .quad 0x3FFF67DBCCF922DC /* HI((2^2*(1+28/32+1/64))^(1/3)) = 1.962856 */ + .quad 0x3FFF93E9DAD7A4A6 /* HI((2^2*(1+29/32+1/64))^(1/3)) = 1.973612 */ + .quad 0x3FFFBF7E4E8CC9CB /* HI((2^2*(1+30/32+1/64))^(1/3)) = 1.984251 */ + .quad 0x3FFFEA9C61E47CD3 /* HI((2^2*(1+31/32+1/64))^(1/3)) = 1.994778 */ + .align 16 + .quad 0x3F93750AD588F115, 0x3F93750AD588F115 /* _dA7 */ + .align 16 + .quad 0xBF98090D6221A247, 0xBF98090D6221A247 /* _dA6 */ + .align 16 + .quad 0x3F9EE7113506AC12, 0x3F9EE7113506AC12 /* _dA5 */ + .align 16 + .quad 0xBFA511E8D2B3183B, 0xBFA511E8D2B3183B /* _dA4 */ + .align 16 + .quad 0x3FAF9ADD3C0CA458, 0x3FAF9ADD3C0CA458 /* _dA3 */ + .align 16 + .quad 0xBFBC71C71C71C71C, 0xBFBC71C71C71C71C /* _dA2 */ + .align 16 + .quad 0x3FD5555555555555, 0x3FD5555555555555 /* _dA1 */ + .align 16 + .quad 0xBFF0400000000000, 0xBFF0400000000000 /* _dNeg65Div64 */ + .align 16 + .quad 0x000FC00000000000, 0x000FC00000000000 /* _dSgnf6Mask */ + .align 16 + .quad 0xBFF0000000000000, 0xBFF0000000000000 /* _dNegOne */ + .align 16 + .quad 0x000FFFFFFFFFFFFF, 0x000FFFFFFFFFFFFF /* _dMantissaMask */ + .align 16 + .quad 0xFFF0000000000000, 0xFFF0000000000000 /* _lExpHiMask */ + .align 16 + .quad 0x00000000000007FF, 0x00000000000007FF /* _lExpLoMask */ + .align 16 + .quad 0x0000000000001556, 0x0000000000001556 /* _l1556 */ + .align 16 + .long 0x000F8000, 0x000F8000, 0x000F8000, 0x000F8000 /* _iRcpIndexMask */ + .align 16 + .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF /* _iAbsMask */ + .align 16 + .long 0x00000800, 0x00000800, 0x00000800, 0x00000800 /* _iSignMask */ + .align 16 + .long 0x000002AA, 0x000002AA, 0x000002AA, 0x000002AA /* _iBias */ + .align 16 + .long 0x80100000, 0x80100000, 0x80100000, 0x80100000 /* _iSub */ + .align 16 + .long 0xffdfffff, 0xffdfffff, 0xffdfffff, 0xffdfffff /* _iCmp */ + .align 16 + .type __svml_dcbrt_data_internal,@object + .size __svml_dcbrt_data_internal,.-__svml_dcbrt_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core-sse.S new file mode 100644 index 0000000000..3b54f31fbc --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized cbrt, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_cbrt _ZGVdN4v_cbrt_sse_wrapper +#include "../svml_d_cbrt4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core.c new file mode 100644 index 0000000000..0b135877aa --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized cbrt, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_cbrt +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_cbrt, __GI__ZGVdN4v_cbrt, __redirect__ZGVdN4v_cbrt) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core_avx2.S new file mode 100644 index 0000000000..2223c5309f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt4_core_avx2.S @@ -0,0 +1,505 @@ +/* Function cbrt vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * x=2^{3*k+j} * 1.b1 b2 ... b5 b6 ... b52 + * Let r=(x*2^{-3k-j} - 1.b1 b2 ... b5 1)* rcp[b1 b2 ..b5], + * where rcp[b1 b2 .. b5]=1/(1.b1 b2 b3 b4 b5 1) in double precision + * cbrt(2^j * 1. b1 b2 .. b5 1) is approximated as T[j][b1..b5]+D[j][b1..b5] + * (T stores the high 53 bits, D stores the low order bits) + * Result=2^k*T+(2^k*T*r)*P+2^k*D + * where P=p1+p2*r+..+p8*r^7 + * + */ + +/* Offsets for data table __svml_dcbrt_data_internal + */ +#define _dRcp 0 +#define _dCbrtHiLo 256 +#define _dA7 1024 +#define _dA6 1056 +#define _dA5 1088 +#define _dA4 1120 +#define _dA3 1152 +#define _dA2 1184 +#define _dA1 1216 +#define _dNeg65Div64 1248 +#define _dSgnf6Mask 1280 +#define _dNegOne 1312 +#define _dMantissaMask 1344 +#define _lExpHiMask 1376 +#define _lExpLoMask 1408 +#define _l1556 1440 +#define _iRcpIndexMask 1472 +#define _iAbsMask 1504 +#define _iSignMask 1536 +#define _iBias 1568 +#define _iSub 1600 +#define _iCmp 1632 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_cbrt_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + +/* Load 1/(1+iRcpIndex/32+1/64) reciprocal table value */ + lea __svml_dcbrt_data_internal(%rip), %rax + vmovapd %ymm0, %ymm5 + +/* + * Declarations + * Load constants + * Get iX - high part of argument + */ + vextractf128 $1, %ymm5, %xmm6 + +/* Calculate CbrtIndex */ + vpsrlq $52, %ymm5, %ymm15 + vshufps $221, %xmm6, %xmm5, %xmm4 + +/* Calculate Rcp table index */ + vandps _iRcpIndexMask+__svml_dcbrt_data_internal(%rip), %xmm4, %xmm10 + vpsrld $12, %xmm10, %xmm3 + vmovd %xmm3, %ecx + +/* If the exponent field is zero - go to callout to process denormals */ + vandps _iAbsMask+__svml_dcbrt_data_internal(%rip), %xmm4, %xmm7 + +/* Compute 2^k */ + vpsrld $20, %xmm4, %xmm4 + vpsubd _iSub+__svml_dcbrt_data_internal(%rip), %xmm7, %xmm8 + vandps _lExpLoMask+__svml_dcbrt_data_internal(%rip), %ymm15, %ymm0 + vpmuludq _l1556+__svml_dcbrt_data_internal(%rip), %ymm0, %ymm6 + vpextrd $2, %xmm3, %edi + movslq %ecx, %rcx + vpextrd $1, %xmm3, %esi + movslq %edi, %rdi + vpextrd $3, %xmm3, %r8d + movslq %esi, %rsi + movslq %r8d, %r8 + vpcmpgtd _iCmp+__svml_dcbrt_data_internal(%rip), %xmm8, %xmm9 + vmovsd (%rax,%rcx), %xmm11 + vmovmskps %xmm9, %edx + vmovsd (%rax,%rdi), %xmm13 + vmovhpd (%rax,%rsi), %xmm11, %xmm12 + vmovhpd (%rax,%r8), %xmm13, %xmm14 + vextractf128 $1, %ymm6, %xmm7 + vshufps $136, %xmm7, %xmm6, %xmm8 + vmovups __VUNPACK_ODD_ind1.613.0.1(%rip), %ymm7 + vextractf128 $1, %ymm0, %xmm1 + vshufps $136, %xmm1, %xmm0, %xmm9 + vpsrld $14, %xmm8, %xmm1 + vpsubd %xmm1, %xmm9, %xmm10 + vpaddd %xmm1, %xmm1, %xmm11 + +/* + * VAND( L, l2k, = l2k, lExpHiMask ); + * Argument reduction Z + */ + vandpd _dMantissaMask+__svml_dcbrt_data_internal(%rip), %ymm5, %ymm9 + vinsertf128 $1, %xmm14, %ymm12, %ymm2 + vpsubd %xmm11, %xmm10, %xmm12 + vpslld $8, %xmm12, %xmm13 + vpaddd %xmm13, %xmm3, %xmm15 + +/* Load cbrt(2^j*(1+iRcpIndex/32+1/64)) Hi & Lo values */ + vmovd %xmm15, %r9d + vpextrd $2, %xmm15, %r11d + movslq %r9d, %r9 + vpextrd $1, %xmm15, %r10d + movslq %r11d, %r11 + vpextrd $3, %xmm15, %ecx + movslq %r10d, %r10 + movslq %ecx, %rcx + vmovsd 256(%rax,%r9), %xmm3 + vmovsd 256(%rax,%r11), %xmm0 + vandpd _dSgnf6Mask+__svml_dcbrt_data_internal(%rip), %ymm5, %ymm10 + vmovhpd 256(%rax,%r10), %xmm3, %xmm14 + vmovhpd 256(%rax,%rcx), %xmm0, %xmm3 + vorpd _dNegOne+__svml_dcbrt_data_internal(%rip), %ymm9, %ymm11 + vorpd _dNeg65Div64+__svml_dcbrt_data_internal(%rip), %ymm10, %ymm12 + vsubpd %ymm12, %ymm11, %ymm13 + vmulpd %ymm13, %ymm2, %ymm2 + vinsertf128 $1, %xmm3, %ymm14, %ymm0 + vpand _iSignMask+__svml_dcbrt_data_internal(%rip), %xmm4, %xmm3 + vpor _iBias+__svml_dcbrt_data_internal(%rip), %xmm3, %xmm4 + vpaddd %xmm1, %xmm4, %xmm1 + vpslld $20, %xmm1, %xmm6 + +/* Polynomial */ + vmovupd _dA7+__svml_dcbrt_data_internal(%rip), %ymm1 + vfmadd213pd _dA6+__svml_dcbrt_data_internal(%rip), %ymm2, %ymm1 + vfmadd213pd _dA5+__svml_dcbrt_data_internal(%rip), %ymm2, %ymm1 + vfmadd213pd _dA4+__svml_dcbrt_data_internal(%rip), %ymm2, %ymm1 + vfmadd213pd _dA3+__svml_dcbrt_data_internal(%rip), %ymm2, %ymm1 + vfmadd213pd _dA2+__svml_dcbrt_data_internal(%rip), %ymm2, %ymm1 + vfmadd213pd _dA1+__svml_dcbrt_data_internal(%rip), %ymm2, %ymm1 + vpermps %ymm6, %ymm7, %ymm8 + vandps __VUNPACK_ODD_mask.613.0.1(%rip), %ymm8, %ymm14 + +/* THi*2^k, TLo*2^k */ + vmulpd %ymm14, %ymm0, %ymm0 + +/* THi*2^k*Z */ + vmulpd %ymm0, %ymm2, %ymm2 + +/* Final reconstruction */ + vmulpd %ymm2, %ymm1, %ymm3 + vaddpd %ymm3, %ymm0, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm5 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm5, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call cbrt@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_cbrt_avx2) + .section .rodata, "a" + .align 32 + +__VUNPACK_ODD_ind1.613.0.1: + .rept 3 + .long 0 + .endr + .long 1 + .long 0 + .long 2 + .long 0 + .long 3 + .align 32 + +__VUNPACK_ODD_mask.613.0.1: + .long 0 + .long -1 + .long 0 + .long -1 + .long 0 + .long -1 + .long 0 + .long -1 + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dcbrt_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dRcp[32][2]; + __declspec(align(32)) VUINT32 _dCbrtHiLo[96][2]; + __declspec(align(32)) VUINT32 _dA7[4][2]; + __declspec(align(32)) VUINT32 _dA6[4][2]; + __declspec(align(32)) VUINT32 _dA5[4][2]; + __declspec(align(32)) VUINT32 _dA4[4][2]; + __declspec(align(32)) VUINT32 _dA3[4][2]; + __declspec(align(32)) VUINT32 _dA2[4][2]; + __declspec(align(32)) VUINT32 _dA1[4][2]; + __declspec(align(32)) VUINT32 _dNeg65Div64[4][2]; + __declspec(align(32)) VUINT32 _dSgnf6Mask[4][2]; + __declspec(align(32)) VUINT32 _dNegOne[4][2]; + __declspec(align(32)) VUINT32 _dMantissaMask[4][2]; + __declspec(align(32)) VUINT32 _lExpHiMask[4][2]; + __declspec(align(32)) VUINT32 _lExpLoMask[4][2]; + __declspec(align(32)) VUINT32 _l1556[4][2]; + __declspec(align(32)) VUINT32 _iRcpIndexMask[8][1]; + __declspec(align(32)) VUINT32 _iAbsMask[8][1]; + __declspec(align(32)) VUINT32 _iSignMask[8][1]; + __declspec(align(32)) VUINT32 _iBias[8][1]; + __declspec(align(32)) VUINT32 _iSub[8][1]; + __declspec(align(32)) VUINT32 _iCmp[8][1]; +} __svml_dcbrt_data_internal; +#endif +__svml_dcbrt_data_internal: + /*== _dRcp ==*/ + .quad 0xBFEF81F81F81F820 /* (1/(1+0/32+1/64)) = -.984615 */ + .quad 0xBFEE9131ABF0B767 /* (1/(1+1/32+1/64)) = -.955224 */ + .quad 0xBFEDAE6076B981DB /* (1/(1+2/32+1/64)) = -.927536 */ + .quad 0xBFECD85689039B0B /* (1/(1+3/32+1/64)) = -.901408 */ + .quad 0xBFEC0E070381C0E0 /* (1/(1+4/32+1/64)) = -.876712 */ + .quad 0xBFEB4E81B4E81B4F /* (1/(1+5/32+1/64)) = -.853333 */ + .quad 0xBFEA98EF606A63BE /* (1/(1+6/32+1/64)) = -.831169 */ + .quad 0xBFE9EC8E951033D9 /* (1/(1+7/32+1/64)) = -.810127 */ + .quad 0xBFE948B0FCD6E9E0 /* (1/(1+8/32+1/64)) = -.790123 */ + .quad 0xBFE8ACB90F6BF3AA /* (1/(1+9/32+1/64)) = -.771084 */ + .quad 0xBFE8181818181818 /* (1/(1+10/32+1/64)) = -.752941 */ + .quad 0xBFE78A4C8178A4C8 /* (1/(1+11/32+1/64)) = -.735632 */ + .quad 0xBFE702E05C0B8170 /* (1/(1+12/32+1/64)) = -.719101 */ + .quad 0xBFE6816816816817 /* (1/(1+13/32+1/64)) = -.703297 */ + .quad 0xBFE6058160581606 /* (1/(1+14/32+1/64)) = -.688172 */ + .quad 0xBFE58ED2308158ED /* (1/(1+15/32+1/64)) = -.673684 */ + .quad 0xBFE51D07EAE2F815 /* (1/(1+16/32+1/64)) = -.659794 */ + .quad 0xBFE4AFD6A052BF5B /* (1/(1+17/32+1/64)) = -.646465 */ + .quad 0xBFE446F86562D9FB /* (1/(1+18/32+1/64)) = -.633663 */ + .quad 0xBFE3E22CBCE4A902 /* (1/(1+19/32+1/64)) = -.621359 */ + .quad 0xBFE3813813813814 /* (1/(1+20/32+1/64)) = -.609524 */ + .quad 0xBFE323E34A2B10BF /* (1/(1+21/32+1/64)) = -.598131 */ + .quad 0xBFE2C9FB4D812CA0 /* (1/(1+22/32+1/64)) = -.587156 */ + .quad 0xBFE27350B8812735 /* (1/(1+23/32+1/64)) = -.576577 */ + .quad 0xBFE21FB78121FB78 /* (1/(1+24/32+1/64)) = -.566372 */ + .quad 0xBFE1CF06ADA2811D /* (1/(1+25/32+1/64)) = -.556522 */ + .quad 0xBFE1811811811812 /* (1/(1+26/32+1/64)) = -.547009 */ + .quad 0xBFE135C81135C811 /* (1/(1+27/32+1/64)) = -.537815 */ + .quad 0xBFE0ECF56BE69C90 /* (1/(1+28/32+1/64)) = -.528926 */ + .quad 0xBFE0A6810A6810A7 /* (1/(1+29/32+1/64)) = -.520325 */ + .quad 0xBFE0624DD2F1A9FC /* (1/(1+30/32+1/64)) = -.512 */ + .quad 0xBFE0204081020408 /* (1/(1+31/32+1/64)) = -.503937 */ + /*== _dCbrtHiLo ==*/ + .align 32 + .quad 0x3FF01539221D4C97 /* HI((2^0*(1+0/32+1/64))^(1/3)) = 1.005181 */ + .quad 0x3FF03F06771A2E33 /* HI((2^0*(1+1/32+1/64))^(1/3)) = 1.015387 */ + .quad 0x3FF06800E629D671 /* HI((2^0*(1+2/32+1/64))^(1/3)) = 1.025391 */ + .quad 0x3FF090328731DEB2 /* HI((2^0*(1+3/32+1/64))^(1/3)) = 1.035204 */ + .quad 0x3FF0B7A4B1BD64AC /* HI((2^0*(1+4/32+1/64))^(1/3)) = 1.044835 */ + .quad 0x3FF0DE601024FB87 /* HI((2^0*(1+5/32+1/64))^(1/3)) = 1.054291 */ + .quad 0x3FF1046CB0597000 /* HI((2^0*(1+6/32+1/64))^(1/3)) = 1.06358 */ + .quad 0x3FF129D212A9BA9B /* HI((2^0*(1+7/32+1/64))^(1/3)) = 1.07271 */ + .quad 0x3FF14E9736CDAF38 /* HI((2^0*(1+8/32+1/64))^(1/3)) = 1.081687 */ + .quad 0x3FF172C2A772F507 /* HI((2^0*(1+9/32+1/64))^(1/3)) = 1.090518 */ + .quad 0x3FF1965A848001D3 /* HI((2^0*(1+10/32+1/64))^(1/3)) = 1.099207 */ + .quad 0x3FF1B9648C38C55D /* HI((2^0*(1+11/32+1/64))^(1/3)) = 1.107762 */ + .quad 0x3FF1DBE6236A0C45 /* HI((2^0*(1+12/32+1/64))^(1/3)) = 1.116186 */ + .quad 0x3FF1FDE45CBB1F9F /* HI((2^0*(1+13/32+1/64))^(1/3)) = 1.124485 */ + .quad 0x3FF21F63FF409042 /* HI((2^0*(1+14/32+1/64))^(1/3)) = 1.132664 */ + .quad 0x3FF240698C6746E5 /* HI((2^0*(1+15/32+1/64))^(1/3)) = 1.140726 */ + .quad 0x3FF260F9454BB99B /* HI((2^0*(1+16/32+1/64))^(1/3)) = 1.148675 */ + .quad 0x3FF281172F8E7073 /* HI((2^0*(1+17/32+1/64))^(1/3)) = 1.156516 */ + .quad 0x3FF2A0C719B4B6D0 /* HI((2^0*(1+18/32+1/64))^(1/3)) = 1.164252 */ + .quad 0x3FF2C00C9F2263EC /* HI((2^0*(1+19/32+1/64))^(1/3)) = 1.171887 */ + .quad 0x3FF2DEEB2BB7FB78 /* HI((2^0*(1+20/32+1/64))^(1/3)) = 1.179423 */ + .quad 0x3FF2FD65FF1EFBBC /* HI((2^0*(1+21/32+1/64))^(1/3)) = 1.186865 */ + .quad 0x3FF31B802FCCF6A2 /* HI((2^0*(1+22/32+1/64))^(1/3)) = 1.194214 */ + .quad 0x3FF3393CADC50708 /* HI((2^0*(1+23/32+1/64))^(1/3)) = 1.201474 */ + .quad 0x3FF3569E451E4C2A /* HI((2^0*(1+24/32+1/64))^(1/3)) = 1.208647 */ + .quad 0x3FF373A7A0554CDE /* HI((2^0*(1+25/32+1/64))^(1/3)) = 1.215736 */ + .quad 0x3FF3905B4A6D76CE /* HI((2^0*(1+26/32+1/64))^(1/3)) = 1.222743 */ + .quad 0x3FF3ACBBB0E756B6 /* HI((2^0*(1+27/32+1/64))^(1/3)) = 1.229671 */ + .quad 0x3FF3C8CB258FA340 /* HI((2^0*(1+28/32+1/64))^(1/3)) = 1.236522 */ + .quad 0x3FF3E48BE02AC0CE /* HI((2^0*(1+29/32+1/64))^(1/3)) = 1.243297 */ + .quad 0x3FF4000000000000 /* HI((2^0*(1+30/32+1/64))^(1/3)) = 1.25 */ + .quad 0x3FF41B298D47800E /* HI((2^0*(1+31/32+1/64))^(1/3)) = 1.256631 */ + .quad 0x3FF443604B34D9B2 /* HI((2^1*(1+0/32+1/64))^(1/3)) = 1.266449 */ + .quad 0x3FF4780B20906571 /* HI((2^1*(1+1/32+1/64))^(1/3)) = 1.279307 */ + .quad 0x3FF4ABAC3EE06706 /* HI((2^1*(1+2/32+1/64))^(1/3)) = 1.291912 */ + .quad 0x3FF4DE505DA66B8D /* HI((2^1*(1+3/32+1/64))^(1/3)) = 1.304276 */ + .quad 0x3FF51003420A5C07 /* HI((2^1*(1+4/32+1/64))^(1/3)) = 1.316409 */ + .quad 0x3FF540CFD6FD11C1 /* HI((2^1*(1+5/32+1/64))^(1/3)) = 1.328323 */ + .quad 0x3FF570C04260716B /* HI((2^1*(1+6/32+1/64))^(1/3)) = 1.340027 */ + .quad 0x3FF59FDDF7A45F38 /* HI((2^1*(1+7/32+1/64))^(1/3)) = 1.35153 */ + .quad 0x3FF5CE31C83539DF /* HI((2^1*(1+8/32+1/64))^(1/3)) = 1.36284 */ + .quad 0x3FF5FBC3F20966A4 /* HI((2^1*(1+9/32+1/64))^(1/3)) = 1.373966 */ + .quad 0x3FF6289C2C8F1B70 /* HI((2^1*(1+10/32+1/64))^(1/3)) = 1.384915 */ + .quad 0x3FF654C1B4316DCF /* HI((2^1*(1+11/32+1/64))^(1/3)) = 1.395693 */ + .quad 0x3FF6803B54A34E44 /* HI((2^1*(1+12/32+1/64))^(1/3)) = 1.406307 */ + .quad 0x3FF6AB0F72182659 /* HI((2^1*(1+13/32+1/64))^(1/3)) = 1.416763 */ + .quad 0x3FF6D544118C08BC /* HI((2^1*(1+14/32+1/64))^(1/3)) = 1.427067 */ + .quad 0x3FF6FEDEE0388D4A /* HI((2^1*(1+15/32+1/64))^(1/3)) = 1.437224 */ + .quad 0x3FF727E53A4F645E /* HI((2^1*(1+16/32+1/64))^(1/3)) = 1.44724 */ + .quad 0x3FF7505C31104114 /* HI((2^1*(1+17/32+1/64))^(1/3)) = 1.457119 */ + .quad 0x3FF77848904CD549 /* HI((2^1*(1+18/32+1/64))^(1/3)) = 1.466866 */ + .quad 0x3FF79FAEE36B2534 /* HI((2^1*(1+19/32+1/64))^(1/3)) = 1.476485 */ + .quad 0x3FF7C69379F4605B /* HI((2^1*(1+20/32+1/64))^(1/3)) = 1.48598 */ + .quad 0x3FF7ECFA6BBCA391 /* HI((2^1*(1+21/32+1/64))^(1/3)) = 1.495356 */ + .quad 0x3FF812E79CAE7EB9 /* HI((2^1*(1+22/32+1/64))^(1/3)) = 1.504615 */ + .quad 0x3FF8385EC043C71D /* HI((2^1*(1+23/32+1/64))^(1/3)) = 1.513762 */ + .quad 0x3FF85D635CB41B9D /* HI((2^1*(1+24/32+1/64))^(1/3)) = 1.5228 */ + .quad 0x3FF881F8CDE083DB /* HI((2^1*(1+25/32+1/64))^(1/3)) = 1.531731 */ + .quad 0x3FF8A6224802B8A8 /* HI((2^1*(1+26/32+1/64))^(1/3)) = 1.54056 */ + .quad 0x3FF8C9E2DA25E5E4 /* HI((2^1*(1+27/32+1/64))^(1/3)) = 1.549289 */ + .quad 0x3FF8ED3D706E1010 /* HI((2^1*(1+28/32+1/64))^(1/3)) = 1.55792 */ + .quad 0x3FF91034D632B6DF /* HI((2^1*(1+29/32+1/64))^(1/3)) = 1.566457 */ + .quad 0x3FF932CBB7F0CF2D /* HI((2^1*(1+30/32+1/64))^(1/3)) = 1.574901 */ + .quad 0x3FF95504A517BF3A /* HI((2^1*(1+31/32+1/64))^(1/3)) = 1.583256 */ + .quad 0x3FF987AF34F8BB19 /* HI((2^2*(1+0/32+1/64))^(1/3)) = 1.595626 */ + .quad 0x3FF9CA0A8337B317 /* HI((2^2*(1+1/32+1/64))^(1/3)) = 1.611826 */ + .quad 0x3FFA0B1709CC13D5 /* HI((2^2*(1+2/32+1/64))^(1/3)) = 1.627708 */ + .quad 0x3FFA4AE4CE6419ED /* HI((2^2*(1+3/32+1/64))^(1/3)) = 1.643285 */ + .quad 0x3FFA8982A5567031 /* HI((2^2*(1+4/32+1/64))^(1/3)) = 1.658572 */ + .quad 0x3FFAC6FE500AB570 /* HI((2^2*(1+5/32+1/64))^(1/3)) = 1.673582 */ + .quad 0x3FFB036497A15A17 /* HI((2^2*(1+6/32+1/64))^(1/3)) = 1.688328 */ + .quad 0x3FFB3EC164671755 /* HI((2^2*(1+7/32+1/64))^(1/3)) = 1.702821 */ + .quad 0x3FFB791FD288C46F /* HI((2^2*(1+8/32+1/64))^(1/3)) = 1.717071 */ + .quad 0x3FFBB28A44693BE4 /* HI((2^2*(1+9/32+1/64))^(1/3)) = 1.731089 */ + .quad 0x3FFBEB0A72EB6E31 /* HI((2^2*(1+10/32+1/64))^(1/3)) = 1.744883 */ + .quad 0x3FFC22A97BF5F697 /* HI((2^2*(1+11/32+1/64))^(1/3)) = 1.758462 */ + .quad 0x3FFC596FEF6AF983 /* HI((2^2*(1+12/32+1/64))^(1/3)) = 1.771835 */ + .quad 0x3FFC8F65DAC655A3 /* HI((2^2*(1+13/32+1/64))^(1/3)) = 1.785009 */ + .quad 0x3FFCC492D38CE8D9 /* HI((2^2*(1+14/32+1/64))^(1/3)) = 1.797992 */ + .quad 0x3FFCF8FE00B19367 /* HI((2^2*(1+15/32+1/64))^(1/3)) = 1.810789 */ + .quad 0x3FFD2CAE230F8709 /* HI((2^2*(1+16/32+1/64))^(1/3)) = 1.823408 */ + .quad 0x3FFD5FA99D15208F /* HI((2^2*(1+17/32+1/64))^(1/3)) = 1.835855 */ + .quad 0x3FFD91F679B6E505 /* HI((2^2*(1+18/32+1/64))^(1/3)) = 1.848135 */ + .quad 0x3FFDC39A72BF2302 /* HI((2^2*(1+19/32+1/64))^(1/3)) = 1.860255 */ + .quad 0x3FFDF49AF68C1570 /* HI((2^2*(1+20/32+1/64))^(1/3)) = 1.872218 */ + .quad 0x3FFE24FD2D4C23B8 /* HI((2^2*(1+21/32+1/64))^(1/3)) = 1.884031 */ + .quad 0x3FFE54C5FDC5EC73 /* HI((2^2*(1+22/32+1/64))^(1/3)) = 1.895697 */ + .quad 0x3FFE83FA11B81DBB /* HI((2^2*(1+23/32+1/64))^(1/3)) = 1.907221 */ + .quad 0x3FFEB29DD9DBAF25 /* HI((2^2*(1+24/32+1/64))^(1/3)) = 1.918608 */ + .quad 0x3FFEE0B59191D374 /* HI((2^2*(1+25/32+1/64))^(1/3)) = 1.929861 */ + .quad 0x3FFF0E454245E4BF /* HI((2^2*(1+26/32+1/64))^(1/3)) = 1.940984 */ + .quad 0x3FFF3B50C68A9DD3 /* HI((2^2*(1+27/32+1/64))^(1/3)) = 1.951981 */ + .quad 0x3FFF67DBCCF922DC /* HI((2^2*(1+28/32+1/64))^(1/3)) = 1.962856 */ + .quad 0x3FFF93E9DAD7A4A6 /* HI((2^2*(1+29/32+1/64))^(1/3)) = 1.973612 */ + .quad 0x3FFFBF7E4E8CC9CB /* HI((2^2*(1+30/32+1/64))^(1/3)) = 1.984251 */ + .quad 0x3FFFEA9C61E47CD3 /* HI((2^2*(1+31/32+1/64))^(1/3)) = 1.994778 */ + .align 32 + .quad 0x3F93750AD588F115, 0x3F93750AD588F115, 0x3F93750AD588F115, 0x3F93750AD588F115 /* _dA7 */ + .align 32 + .quad 0xBF98090D6221A247, 0xBF98090D6221A247, 0xBF98090D6221A247, 0xBF98090D6221A247 /* _dA6 */ + .align 32 + .quad 0x3F9EE7113506AC12, 0x3F9EE7113506AC12, 0x3F9EE7113506AC12, 0x3F9EE7113506AC12 /* _dA5 */ + .align 32 + .quad 0xBFA511E8D2B3183B, 0xBFA511E8D2B3183B, 0xBFA511E8D2B3183B, 0xBFA511E8D2B3183B /* _dA4 */ + .align 32 + .quad 0x3FAF9ADD3C0CA458, 0x3FAF9ADD3C0CA458, 0x3FAF9ADD3C0CA458, 0x3FAF9ADD3C0CA458 /* _dA3 */ + .align 32 + .quad 0xBFBC71C71C71C71C, 0xBFBC71C71C71C71C, 0xBFBC71C71C71C71C, 0xBFBC71C71C71C71C /* _dA2 */ + .align 32 + .quad 0x3FD5555555555555, 0x3FD5555555555555, 0x3FD5555555555555, 0x3FD5555555555555 /* _dA1 */ + .align 32 + .quad 0xBFF0400000000000, 0xBFF0400000000000, 0xBFF0400000000000, 0xBFF0400000000000 /* _dNeg65Div64 */ + .align 32 + .quad 0x000FC00000000000, 0x000FC00000000000, 0x000FC00000000000, 0x000FC00000000000 /* _dSgnf6Mask */ + .align 32 + .quad 0xBFF0000000000000, 0xBFF0000000000000, 0xBFF0000000000000, 0xBFF0000000000000 /* _dNegOne */ + .align 32 + .quad 0x000FFFFFFFFFFFFF, 0x000FFFFFFFFFFFFF, 0x000FFFFFFFFFFFFF, 0x000FFFFFFFFFFFFF /* _dMantissaMask */ + .align 32 + .quad 0xFFF0000000000000, 0xFFF0000000000000, 0xFFF0000000000000, 0xFFF0000000000000 /* _lExpHiMask */ + .align 32 + .quad 0x00000000000007FF, 0x00000000000007FF, 0x00000000000007FF, 0x00000000000007FF /* _lExpLoMask */ + .align 32 + .quad 0x0000000000001556, 0x0000000000001556, 0x0000000000001556, 0x0000000000001556 /* _l1556 */ + .align 32 + .long 0x000F8000, 0x000F8000, 0x000F8000, 0x000F8000, 0x000F8000, 0x000F8000, 0x000F8000, 0x000F8000 /* _iRcpIndexMask */ + .align 32 + .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF /* _iAbsMask */ + .align 32 + .long 0x00000800, 0x00000800, 0x00000800, 0x00000800, 0x00000800, 0x00000800, 0x00000800, 0x00000800 /* _iSignMask */ + .align 32 + .long 0x000002AA, 0x000002AA, 0x000002AA, 0x000002AA, 0x000002AA, 0x000002AA, 0x000002AA, 0x000002AA /* _iBias */ + .align 32 + .long 0x80100000, 0x80100000, 0x80100000, 0x80100000, 0x80100000, 0x80100000, 0x80100000, 0x80100000 /* _iSub */ + .align 32 + .long 0xffdfffff, 0xffdfffff, 0xffdfffff, 0xffdfffff, 0xffdfffff, 0xffdfffff, 0xffdfffff, 0xffdfffff /* _iCmp */ + .align 32 + .type __svml_dcbrt_data_internal,@object + .size __svml_dcbrt_data_internal,.-__svml_dcbrt_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core-avx2.S new file mode 100644 index 0000000000..3831e582ce --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized cbrt, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_cbrt _ZGVeN8v_cbrt_avx2_wrapper +#include "../svml_d_cbrt8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core.c new file mode 100644 index 0000000000..28c147216f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized cbrt, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_cbrt +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_cbrt, __GI__ZGVeN8v_cbrt, __redirect__ZGVeN8v_cbrt) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core_avx512.S new file mode 100644 index 0000000000..b9c071b54c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_cbrt8_core_avx512.S @@ -0,0 +1,253 @@ +/* Function cbrt vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * x=2^{3*k+j} * 1.b1 b2 ... b5 b6 ... b52 + * Let r=(x*2^{-3k-j} - 1.b1 b2 ... b5 1)* rcp[b1 b2 ..b5], + * where rcp[b1 b2 .. b5]=1/(1.b1 b2 b3 b4 b5 1) in double precision + * cbrt(2^j * 1. b1 b2 .. b5 1) is approximated as T[j][b1..b5]+D[j][b1..b5] + * (T stores the high 53 bits, D stores the low order bits) + * Result=2^k*T+(2^k*T*r)*P+2^k*D + * where P=p1+p2*r+..+p8*r^7 + * + */ + +/* Offsets for data table __svml_dcbrt_data_internal_avx512 + */ +#define etbl_H 0 +#define etbl_L 64 +#define cbrt_tbl_H 128 +#define BiasL 256 +#define SZero 320 +#define OneThird 384 +#define Bias3 448 +#define Three 512 +#define One 576 +#define poly_coeff10 640 +#define poly_coeff9 704 +#define poly_coeff8 768 +#define poly_coeff7 832 +#define poly_coeff6 896 +#define poly_coeff5 960 +#define poly_coeff4 1024 +#define poly_coeff3 1088 +#define poly_coeff2 1152 +#define poly_coeff1 1216 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_cbrt_skx) + vgetmantpd $0, {sae}, %zmm0, %zmm14 + +/* GetExp(x) */ + vgetexppd {sae}, %zmm0, %zmm7 + vmovups BiasL+__svml_dcbrt_data_internal_avx512(%rip), %zmm8 + +/* exponent/3 */ + vmovups OneThird+__svml_dcbrt_data_internal_avx512(%rip), %zmm9 + vmovups Bias3+__svml_dcbrt_data_internal_avx512(%rip), %zmm10 + +/* Reduced argument: R = DblRcp*Mantissa - 1 */ + vmovups One+__svml_dcbrt_data_internal_avx512(%rip), %zmm2 + +/* exponent%3 (to be used as index) */ + vmovups Three+__svml_dcbrt_data_internal_avx512(%rip), %zmm11 + +/* DblRcp ~ 1/Mantissa */ + vrcp14pd %zmm14, %zmm13 + vaddpd {rn-sae}, %zmm8, %zmm7, %zmm12 + vandpd SZero+__svml_dcbrt_data_internal_avx512(%rip), %zmm0, %zmm6 + +/* round DblRcp to 3 fractional bits (RN mode, no Precision exception) */ + vrndscalepd $72, {sae}, %zmm13, %zmm15 + vfmsub231pd {rn-sae}, %zmm12, %zmm9, %zmm10 + +/* polynomial */ + vmovups poly_coeff10+__svml_dcbrt_data_internal_avx512(%rip), %zmm0 + vmovups poly_coeff8+__svml_dcbrt_data_internal_avx512(%rip), %zmm7 + vmovups poly_coeff7+__svml_dcbrt_data_internal_avx512(%rip), %zmm9 + vfmsub231pd {rn-sae}, %zmm15, %zmm14, %zmm2 + vrndscalepd $9, {sae}, %zmm10, %zmm5 + +/* Table lookup */ + vmovups cbrt_tbl_H+__svml_dcbrt_data_internal_avx512(%rip), %zmm10 + vmovups poly_coeff6+__svml_dcbrt_data_internal_avx512(%rip), %zmm8 + vmovups poly_coeff3+__svml_dcbrt_data_internal_avx512(%rip), %zmm13 + vfmadd231pd {rn-sae}, %zmm2, %zmm7, %zmm9 + vfnmadd231pd {rn-sae}, %zmm5, %zmm11, %zmm12 + vmovups poly_coeff5+__svml_dcbrt_data_internal_avx512(%rip), %zmm11 + vmovups poly_coeff1+__svml_dcbrt_data_internal_avx512(%rip), %zmm14 + +/* Prepare table index */ + vpsrlq $49, %zmm15, %zmm1 + +/* Table lookup: 2^(exponent%3) */ + vpermpd __svml_dcbrt_data_internal_avx512(%rip), %zmm12, %zmm4 + vpermpd etbl_L+__svml_dcbrt_data_internal_avx512(%rip), %zmm12, %zmm3 + vpermt2pd cbrt_tbl_H+64+__svml_dcbrt_data_internal_avx512(%rip), %zmm1, %zmm10 + vmovups poly_coeff9+__svml_dcbrt_data_internal_avx512(%rip), %zmm1 + vfmadd231pd {rn-sae}, %zmm2, %zmm8, %zmm11 + vmovups poly_coeff2+__svml_dcbrt_data_internal_avx512(%rip), %zmm12 + vscalefpd {rn-sae}, %zmm5, %zmm10, %zmm15 + vfmadd231pd {rn-sae}, %zmm2, %zmm0, %zmm1 + vmovups poly_coeff4+__svml_dcbrt_data_internal_avx512(%rip), %zmm5 + vfmadd231pd {rn-sae}, %zmm2, %zmm12, %zmm14 + vmulpd {rn-sae}, %zmm2, %zmm2, %zmm0 + vfmadd231pd {rn-sae}, %zmm2, %zmm5, %zmm13 + +/* Sh*R */ + vmulpd {rn-sae}, %zmm2, %zmm4, %zmm2 + vfmadd213pd {rn-sae}, %zmm9, %zmm0, %zmm1 + vfmadd213pd {rn-sae}, %zmm11, %zmm0, %zmm1 + vfmadd213pd {rn-sae}, %zmm13, %zmm0, %zmm1 + vfmadd213pd {rn-sae}, %zmm14, %zmm0, %zmm1 + +/* Sl + (Sh*R)*Poly */ + vfmadd213pd {rn-sae}, %zmm3, %zmm1, %zmm2 + +/* + * branch-free + * scaled_Th*(Sh+Sl+Sh*R*Poly) + */ + vaddpd {rn-sae}, %zmm4, %zmm2, %zmm3 + vmulpd {rn-sae}, %zmm15, %zmm3, %zmm4 + vorpd %zmm6, %zmm4, %zmm0 + ret + +END(_ZGVeN8v_cbrt_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dcbrt_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 etbl_H[8][2]; + __declspec(align(64)) VUINT32 etbl_L[8][2]; + __declspec(align(64)) VUINT32 cbrt_tbl_H[16][2]; + __declspec(align(64)) VUINT32 BiasL[8][2]; + __declspec(align(64)) VUINT32 SZero[8][2]; + __declspec(align(64)) VUINT32 OneThird[8][2]; + __declspec(align(64)) VUINT32 Bias3[8][2]; + __declspec(align(64)) VUINT32 Three[8][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 poly_coeff10[8][2]; + __declspec(align(64)) VUINT32 poly_coeff9[8][2]; + __declspec(align(64)) VUINT32 poly_coeff8[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + } __svml_dcbrt_data_internal_avx512; +#endif +__svml_dcbrt_data_internal_avx512: + /*== etbl_H ==*/ + .quad 0x3ff0000000000000 + .quad 0x3ff428a2f98d728b + .quad 0x3ff965fea53d6e3d + .quad 0x0000000000000000 + .quad 0xbff0000000000000 + .quad 0xbff428a2f98d728b + .quad 0xbff965fea53d6e3d + .quad 0x0000000000000000 + /*== etbl_L ==*/ + .align 64 + .quad 0x0000000000000000 + .quad 0xbc7ddc22548ea41e + .quad 0xbc9f53e999952f09 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x3c7ddc22548ea41e + .quad 0x3c9f53e999952f09 + .quad 0x0000000000000000 + /*== cbrt_tbl_H ==*/ + .align 64 + .quad 0x3ff428a2f98d728b + .quad 0x3ff361f35ca116ff + .quad 0x3ff2b6b5edf6b54a + .quad 0x3ff220e6dd675180 + .quad 0x3ff19c3b38e975a8 + .quad 0x3ff12589c21fb842 + .quad 0x3ff0ba6ee5f9aad4 + .quad 0x3ff059123d3a9848 + .quad 0x3ff0000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + /*== BiasL ==*/ + .align 64 + .quad 0x4338000000000000, 0x4338000000000000, 0x4338000000000000, 0x4338000000000000, 0x4338000000000000, 0x4338000000000000, 0x4338000000000000, 0x4338000000000000 + /*== Zero ==*/ + .align 64 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 + /*== OneThird ==*/ + .align 64 + .quad 0x3fd5555555555556, 0x3fd5555555555556, 0x3fd5555555555556, 0x3fd5555555555556, 0x3fd5555555555556, 0x3fd5555555555556, 0x3fd5555555555556, 0x3fd5555555555556 + /*== Bias3 ==*/ + .align 64 + .quad 0x4320000000000000, 0x4320000000000000, 0x4320000000000000, 0x4320000000000000, 0x4320000000000000, 0x4320000000000000, 0x4320000000000000, 0x4320000000000000 + /*== Three ==*/ + .align 64 + .quad 0x4008000000000000, 0x4008000000000000, 0x4008000000000000, 0x4008000000000000, 0x4008000000000000, 0x4008000000000000, 0x4008000000000000, 0x4008000000000000 + /*==One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== poly_coeff10 ==*/ + .align 64 + .quad 0xbf882e3b6adeca62, 0xbf882e3b6adeca62, 0xbf882e3b6adeca62, 0xbf882e3b6adeca62, 0xbf882e3b6adeca62, 0xbf882e3b6adeca62, 0xbf882e3b6adeca62, 0xbf882e3b6adeca62 + /*== poly_coeff9 ==*/ + .align 64 + .quad 0x3f8bda24bae48875, 0x3f8bda24bae48875, 0x3f8bda24bae48875, 0x3f8bda24bae48875, 0x3f8bda24bae48875, 0x3f8bda24bae48875, 0x3f8bda24bae48875, 0x3f8bda24bae48875 + /*== poly_coeff8 ==*/ + .align 64 + .quad 0xbf9036b87c71d55f, 0xbf9036b87c71d55f, 0xbf9036b87c71d55f, 0xbf9036b87c71d55f, 0xbf9036b87c71d55f, 0xbf9036b87c71d55f, 0xbf9036b87c71d55f, 0xbf9036b87c71d55f + /*== poly_coeff7 ==*/ + .align 64 + .quad 0x3f9374ed9398b914, 0x3f9374ed9398b914, 0x3f9374ed9398b914, 0x3f9374ed9398b914, 0x3f9374ed9398b914, 0x3f9374ed9398b914, 0x3f9374ed9398b914, 0x3f9374ed9398b914 + /*== poly_coeff6 ==*/ + .align 64 + .quad 0xbf98090d77f2468e, 0xbf98090d77f2468e, 0xbf98090d77f2468e, 0xbf98090d77f2468e, 0xbf98090d77f2468e, 0xbf98090d77f2468e, 0xbf98090d77f2468e, 0xbf98090d77f2468e + /*== poly_coeff5 ==*/ + .align 64 + .quad 0x3f9ee71141dcf569, 0x3f9ee71141dcf569, 0x3f9ee71141dcf569, 0x3f9ee71141dcf569, 0x3f9ee71141dcf569, 0x3f9ee71141dcf569, 0x3f9ee71141dcf569, 0x3f9ee71141dcf569 + /*== poly_coeff4 ==*/ + .align 64 + .quad 0xbfa511e8d2b0363e, 0xbfa511e8d2b0363e, 0xbfa511e8d2b0363e, 0xbfa511e8d2b0363e, 0xbfa511e8d2b0363e, 0xbfa511e8d2b0363e, 0xbfa511e8d2b0363e, 0xbfa511e8d2b0363e + /*== poly_coeff3 ==*/ + .align 64 + .quad 0x3faf9add3c0b7e31, 0x3faf9add3c0b7e31, 0x3faf9add3c0b7e31, 0x3faf9add3c0b7e31, 0x3faf9add3c0b7e31, 0x3faf9add3c0b7e31, 0x3faf9add3c0b7e31, 0x3faf9add3c0b7e31 + /*== poly_coeff2 ==*/ + .align 64 + .quad 0xbfbc71c71c71c741, 0xbfbc71c71c71c741, 0xbfbc71c71c71c741, 0xbfbc71c71c71c741, 0xbfbc71c71c71c741, 0xbfbc71c71c71c741, 0xbfbc71c71c71c741, 0xbfbc71c71c71c741 + /*== poly_coeff1 ==*/ + .align 64 + .quad 0x3fd5555555555557, 0x3fd5555555555557, 0x3fd5555555555557, 0x3fd5555555555557, 0x3fd5555555555557, 0x3fd5555555555557, 0x3fd5555555555557, 0x3fd5555555555557 + .align 64 + .type __svml_dcbrt_data_internal_avx512,@object + .size __svml_dcbrt_data_internal_avx512,.-__svml_dcbrt_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core-avx2.S new file mode 100644 index 0000000000..faa847fba6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized cbrtf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_cbrtf _ZGVeN16v_cbrtf_avx2_wrapper +#include "../svml_s_cbrtf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core.c new file mode 100644 index 0000000000..785a68cc0d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized cbrtf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_cbrtf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_cbrtf, __GI__ZGVeN16v_cbrtf, + __redirect__ZGVeN16v_cbrtf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core_avx512.S new file mode 100644 index 0000000000..55b017682b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf16_core_avx512.S @@ -0,0 +1,235 @@ +/* Function cbrtf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * x=2^{3*k+j} * 1.b1 b2 ... b5 b6 ... b52 + * Let r=(x*2^{-3k-j} - 1.b1 b2 ... b5 1)* rcp[b1 b2 ..b5], + * where rcp[b1 b2 .. b5]=1/(1.b1 b2 b3 b4 b5 1) in single precision + * cbrtf(2^j * 1. b1 b2 .. b5 1) is approximated as T[j][b1..b5]+D[j][b1..b5] + * (T stores the high 24 bits, D stores the low order bits) + * Result=2^k*T+(2^k*T*r)*P+2^k*D + * where P=p1+p2*r+.. + * + */ + +/* Offsets for data table __svml_scbrt_data_internal_avx512 + */ +#define etbl_H 0 +#define etbl_L 64 +#define cbrt_tbl_H 128 +#define BiasL 256 +#define SZero 320 +#define OneThird 384 +#define Bias3 448 +#define Three 512 +#define One 576 +#define poly_coeff3 640 +#define poly_coeff2 704 +#define poly_coeff1 768 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_cbrtf_skx) + vgetmantps $0, {sae}, %zmm0, %zmm8 + +/* GetExp(x) */ + vgetexpps {sae}, %zmm0, %zmm1 + vmovups BiasL+__svml_scbrt_data_internal_avx512(%rip), %zmm2 + +/* exponent/3 */ + vmovups OneThird+__svml_scbrt_data_internal_avx512(%rip), %zmm3 + vmovups Bias3+__svml_scbrt_data_internal_avx512(%rip), %zmm4 + vmovups One+__svml_scbrt_data_internal_avx512(%rip), %zmm15 + +/* exponent%3 (to be used as index) */ + vmovups Three+__svml_scbrt_data_internal_avx512(%rip), %zmm5 + +/* polynomial */ + vmovups poly_coeff3+__svml_scbrt_data_internal_avx512(%rip), %zmm11 + vmovups poly_coeff1+__svml_scbrt_data_internal_avx512(%rip), %zmm14 + +/* Table lookup */ + vmovups cbrt_tbl_H+__svml_scbrt_data_internal_avx512(%rip), %zmm12 + +/* DblRcp ~ 1/Mantissa */ + vrcp14ps %zmm8, %zmm7 + vaddps {rn-sae}, %zmm2, %zmm1, %zmm6 + vandps SZero+__svml_scbrt_data_internal_avx512(%rip), %zmm0, %zmm0 + +/* round DblRcp to 3 fractional bits (RN mode, no Precision exception) */ + vrndscaleps $88, {sae}, %zmm7, %zmm9 + vfmsub231ps {rn-sae}, %zmm6, %zmm3, %zmm4 + vmovups poly_coeff2+__svml_scbrt_data_internal_avx512(%rip), %zmm7 + +/* Reduced argument: R = DblRcp*Mantissa - 1 */ + vfmsub231ps {rn-sae}, %zmm9, %zmm8, %zmm15 + vrndscaleps $9, {sae}, %zmm4, %zmm13 + +/* Prepare table index */ + vpsrld $19, %zmm9, %zmm10 + vfmadd231ps {rn-sae}, %zmm15, %zmm11, %zmm7 + vfnmadd231ps {rn-sae}, %zmm13, %zmm5, %zmm6 + vpermt2ps cbrt_tbl_H+64+__svml_scbrt_data_internal_avx512(%rip), %zmm10, %zmm12 + vfmadd213ps {rn-sae}, %zmm14, %zmm15, %zmm7 + vscalefps {rn-sae}, %zmm13, %zmm12, %zmm2 + +/* Table lookup: 2^(exponent%3) */ + vpermps __svml_scbrt_data_internal_avx512(%rip), %zmm6, %zmm1 + vpermps etbl_L+__svml_scbrt_data_internal_avx512(%rip), %zmm6, %zmm6 + +/* Sh*R */ + vmulps {rn-sae}, %zmm15, %zmm1, %zmm14 + +/* Sl + (Sh*R)*Poly */ + vfmadd213ps {rn-sae}, %zmm6, %zmm7, %zmm14 + +/* + * branch-free + * scaled_Th*(Sh+Sl+Sh*R*Poly) + */ + vaddps {rn-sae}, %zmm1, %zmm14, %zmm15 + vmulps {rn-sae}, %zmm2, %zmm15, %zmm3 + vorps %zmm0, %zmm3, %zmm0 + ret + +END(_ZGVeN16v_cbrtf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_scbrt_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 etbl_H[16][1]; + __declspec(align(64)) VUINT32 etbl_L[16][1]; + __declspec(align(64)) VUINT32 cbrt_tbl_H[32][1]; + __declspec(align(64)) VUINT32 BiasL[16][1]; + __declspec(align(64)) VUINT32 SZero[16][1]; + __declspec(align(64)) VUINT32 OneThird[16][1]; + __declspec(align(64)) VUINT32 Bias3[16][1]; + __declspec(align(64)) VUINT32 Three[16][1]; + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 poly_coeff3[16][1]; + __declspec(align(64)) VUINT32 poly_coeff2[16][1]; + __declspec(align(64)) VUINT32 poly_coeff1[16][1]; + } __svml_scbrt_data_internal_avx512; +#endif +__svml_scbrt_data_internal_avx512: + /*== etbl_H ==*/ + .long 0x3f800000 + .long 0x3fa14518 + .long 0x3fcb2ff5 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + /*== etbl_L ==*/ + .align 64 + .long 0x00000000 + .long 0xb2ce51af + .long 0x32a7adc8 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + /*== cbrt_tbl_H ==*/ + .align 64 + .long 0x3fa14518 + .long 0x3f9e0b2b + .long 0x3f9b0f9b + .long 0x3f984a9a + .long 0x3f95b5af + .long 0x3f934b6c + .long 0x3f910737 + .long 0x3f8ee526 + .long 0x3f8ce1da + .long 0x3f8afa6a + .long 0x3f892c4e + .long 0x3f87754e + .long 0x3f85d377 + .long 0x3f844510 + .long 0x3f82c892 + .long 0x3f815c9f + .long 0x3f800000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + .long 0x00000000 + /*== BiasL ==*/ + .align 64 + .long 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000, 0x4b400000 + /*== Zero ==*/ + .align 64 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 + /*== OneThird ==*/ + .align 64 + .long 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab, 0x3eaaaaab + /*== Bias3 ==*/ + .align 64 + .long 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000, 0x4a800000 + /*== Three ==*/ + .align 64 + .long 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000, 0x40400000 + /*==One ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== poly_coeff3 ==*/ + .align 64 + .long 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c, 0x3d7d057c + /*== poly_coeff2 ==*/ + .align 64 + .long 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363, 0xbde3a363 + /*== poly_coeff1 ==*/ + .align 64 + .long 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa, 0x3eaaaaaa + .align 64 + .type __svml_scbrt_data_internal_avx512,@object + .size __svml_scbrt_data_internal_avx512,.-__svml_scbrt_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core-sse2.S new file mode 100644 index 0000000000..76fc254e7a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized cbrtf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_cbrtf _ZGVbN4v_cbrtf_sse2 +#include "../svml_s_cbrtf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core.c new file mode 100644 index 0000000000..564a549b39 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized cbrtf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_cbrtf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_cbrtf, __GI__ZGVbN4v_cbrtf, + __redirect__ZGVbN4v_cbrtf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core_sse4.S new file mode 100644 index 0000000000..af42dd5164 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf4_core_sse4.S @@ -0,0 +1,490 @@ +/* Function cbrtf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * x=2^{3*k+j} * 1.b1 b2 ... b5 b6 ... b52 + * Let r=(x*2^{-3k-j} - 1.b1 b2 ... b5 1)* rcp[b1 b2 ..b5], + * where rcp[b1 b2 .. b5]=1/(1.b1 b2 b3 b4 b5 1) in single precision + * cbrtf(2^j * 1. b1 b2 .. b5 1) is approximated as T[j][b1..b5]+D[j][b1..b5] + * (T stores the high 24 bits, D stores the low order bits) + * Result=2^k*T+(2^k*T*r)*P+2^k*D + * where P=p1+p2*r+.. + * + */ + +/* Offsets for data table __svml_scbrt_data_internal + */ +#define _sRcp 0 +#define _sCbrtHL 128 +#define _sP2 512 +#define _sP1 528 +#define _sMantissaMask 544 +#define _sMantissaMask1 560 +#define _sExpMask 576 +#define _sExpMask1 592 +#define _iRcpIndexMask 608 +#define _iBExpMask 624 +#define _iSignMask 640 +#define _iBias 656 +#define _iOne 672 +#define _i555 688 +#define _iAbsMask 704 +#define _iSubConst 720 +#define _iCmpConst 736 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_cbrtf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* + * Load constants + * Reciprocal index calculation + */ + movaps %xmm0, %xmm2 + movdqu _iRcpIndexMask+__svml_scbrt_data_internal(%rip), %xmm3 + psrld $16, %xmm2 + pand %xmm2, %xmm3 + +/* Load reciprocal value */ + lea __svml_scbrt_data_internal(%rip), %rdx + pshufd $1, %xmm3, %xmm5 + +/* Get signed biased exponent */ + psrld $7, %xmm2 + movd %xmm3, %eax + movd %xmm5, %ecx + +/* Get absolute biased exponent */ + movdqu _iBExpMask+__svml_scbrt_data_internal(%rip), %xmm15 + +/* + * Calculate exponent/3 + * i555Exp=(2^{12}-1)/3*exponent + */ + movdqu _i555+__svml_scbrt_data_internal(%rip), %xmm14 + pand %xmm2, %xmm15 + movslq %eax, %rax + movdqa %xmm14, %xmm5 + movslq %ecx, %rcx + psrlq $32, %xmm14 + pmuludq %xmm15, %xmm5 + movd (%rdx,%rax), %xmm4 + movd (%rdx,%rcx), %xmm6 + punpckldq %xmm6, %xmm4 + movdqa %xmm15, %xmm6 + psrlq $32, %xmm15 + pmuludq %xmm14, %xmm15 + pshufd $2, %xmm3, %xmm7 + psllq $32, %xmm15 + pshufd $3, %xmm3, %xmm8 + movd %xmm7, %esi + movd %xmm8, %edi + +/* Argument reduction */ + movups _sMantissaMask+__svml_scbrt_data_internal(%rip), %xmm12 + movups _sMantissaMask1+__svml_scbrt_data_internal(%rip), %xmm11 + andps %xmm0, %xmm12 + pand .FLT_17(%rip), %xmm5 + andps %xmm0, %xmm11 + movslq %esi, %rsi + por %xmm15, %xmm5 + movslq %edi, %rdi + +/* Get K (exponent=3*k+j) */ + psrld $12, %xmm5 + orps _sExpMask+__svml_scbrt_data_internal(%rip), %xmm12 + orps _sExpMask1+__svml_scbrt_data_internal(%rip), %xmm11 + psubd _iOne+__svml_scbrt_data_internal(%rip), %xmm6 + +/* r=y-y` */ + subps %xmm11, %xmm12 + +/* Get J */ + psubd %xmm5, %xmm6 + movdqu _iAbsMask+__svml_scbrt_data_internal(%rip), %xmm1 + psubd %xmm5, %xmm6 + movd (%rdx,%rsi), %xmm10 + pand %xmm0, %xmm1 + movd (%rdx,%rdi), %xmm9 + psubd %xmm5, %xmm6 + punpckldq %xmm9, %xmm10 + +/* Get 128*J */ + pslld $7, %xmm6 + punpcklqdq %xmm10, %xmm4 + +/* + * iCbrtIndex=4*l+128*j + * Zero index if callout expected + */ + paddd %xmm6, %xmm3 + psubd _iSubConst+__svml_scbrt_data_internal(%rip), %xmm1 + pcmpgtd _iCmpConst+__svml_scbrt_data_internal(%rip), %xmm1 + +/* r=(y-y`)*rcp_table(y`) */ + mulps %xmm12, %xmm4 + movmskps %xmm1, %eax + +/* Biased exponent-1 */ + movdqu _iSignMask+__svml_scbrt_data_internal(%rip), %xmm13 + pandn %xmm3, %xmm1 + +/* + * Add 2/3*(bias-1)+1 to (k+1/3*(bias-1)) + * Attach sign to exponent + */ + movdqu _iBias+__svml_scbrt_data_internal(%rip), %xmm12 + pand %xmm13, %xmm2 + paddd %xmm5, %xmm12 + +/* Load Cbrt table Hi & Lo values */ + movd %xmm1, %r8d + por %xmm2, %xmm12 + pshufd $1, %xmm1, %xmm2 + pslld $23, %xmm12 + pshufd $2, %xmm1, %xmm7 + pshufd $3, %xmm1, %xmm1 + movd %xmm2, %r9d + movd %xmm7, %r10d + movd %xmm1, %r11d + +/* Polynomial: p1+r*(p2*r+r*(p3+r*p4)) */ + movups _sP2+__svml_scbrt_data_internal(%rip), %xmm11 + mulps %xmm4, %xmm11 + movslq %r8d, %r8 + addps _sP1+__svml_scbrt_data_internal(%rip), %xmm11 + movslq %r9d, %r9 + movslq %r10d, %r10 + movslq %r11d, %r11 + movd 128(%rdx,%r8), %xmm10 + movd 128(%rdx,%r9), %xmm3 + movd 128(%rdx,%r10), %xmm9 + movd 128(%rdx,%r11), %xmm8 + punpckldq %xmm3, %xmm10 + punpckldq %xmm8, %xmm9 + punpcklqdq %xmm9, %xmm10 + +/* sCbrtHi *= 2^k */ + mulps %xmm10, %xmm12 + +/* T`*r */ + mulps %xmm12, %xmm4 + +/* (T`*r)*P */ + mulps %xmm4, %xmm11 + +/* + * T`*r*P+D` + * result = T`+(T`*r*P+D`) + */ + addps %xmm11, %xmm12 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax xmm0 xmm12 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm12, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm12, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 eax + + xorl %edx, %edx + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %edx, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %eax, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm12 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm12 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call cbrtf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_cbrtf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_scbrt_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _sRcp[32][1]; + __declspec(align(16)) VUINT32 _sCbrtHL[96][1]; + __declspec(align(16)) VUINT32 _sP2[4][1]; + __declspec(align(16)) VUINT32 _sP1[4][1]; + __declspec(align(16)) VUINT32 _sMantissaMask[4][1]; + __declspec(align(16)) VUINT32 _sMantissaMask1[4][1]; + __declspec(align(16)) VUINT32 _sExpMask[4][1]; + __declspec(align(16)) VUINT32 _sExpMask1[4][1]; + __declspec(align(16)) VUINT32 _iRcpIndexMask[4][1]; + __declspec(align(16)) VUINT32 _iBExpMask[4][1]; + __declspec(align(16)) VUINT32 _iSignMask[4][1]; + __declspec(align(16)) VUINT32 _iBias[4][1]; + __declspec(align(16)) VUINT32 _iOne[4][1]; + __declspec(align(16)) VUINT32 _i555[4][1]; + __declspec(align(16)) VUINT32 _iAbsMask[4][1]; + __declspec(align(16)) VUINT32 _iSubConst[4][1]; + __declspec(align(16)) VUINT32 _iCmpConst[4][1]; +} __svml_scbrt_data_internal; +#endif +__svml_scbrt_data_internal: + /*== _sRcp ==*/ + .long 0xBF7C0FC1 /* (1/(1+0/32+1/64)) = -.984615 */ + .long 0xBF74898D /* (1/(1+1/32+1/64)) = -.955224 */ + .long 0xBF6D7304 /* (1/(1+2/32+1/64)) = -.927536 */ + .long 0xBF66C2B4 /* (1/(1+3/32+1/64)) = -.901408 */ + .long 0xBF607038 /* (1/(1+4/32+1/64)) = -.876712 */ + .long 0xBF5A740E /* (1/(1+5/32+1/64)) = -.853333 */ + .long 0xBF54C77B /* (1/(1+6/32+1/64)) = -.831169 */ + .long 0xBF4F6475 /* (1/(1+7/32+1/64)) = -.810127 */ + .long 0xBF4A4588 /* (1/(1+8/32+1/64)) = -.790123 */ + .long 0xBF4565C8 /* (1/(1+9/32+1/64)) = -.771084 */ + .long 0xBF40C0C1 /* (1/(1+10/32+1/64)) = -.752941 */ + .long 0xBF3C5264 /* (1/(1+11/32+1/64)) = -.735632 */ + .long 0xBF381703 /* (1/(1+12/32+1/64)) = -.719101 */ + .long 0xBF340B41 /* (1/(1+13/32+1/64)) = -.703297 */ + .long 0xBF302C0B /* (1/(1+14/32+1/64)) = -.688172 */ + .long 0xBF2C7692 /* (1/(1+15/32+1/64)) = -.673684 */ + .long 0xBF28E83F /* (1/(1+16/32+1/64)) = -.659794 */ + .long 0xBF257EB5 /* (1/(1+17/32+1/64)) = -.646465 */ + .long 0xBF2237C3 /* (1/(1+18/32+1/64)) = -.633663 */ + .long 0xBF1F1166 /* (1/(1+19/32+1/64)) = -.621359 */ + .long 0xBF1C09C1 /* (1/(1+20/32+1/64)) = -.609524 */ + .long 0xBF191F1A /* (1/(1+21/32+1/64)) = -.598131 */ + .long 0xBF164FDA /* (1/(1+22/32+1/64)) = -.587156 */ + .long 0xBF139A86 /* (1/(1+23/32+1/64)) = -.576577 */ + .long 0xBF10FDBC /* (1/(1+24/32+1/64)) = -.566372 */ + .long 0xBF0E7835 /* (1/(1+25/32+1/64)) = -.556522 */ + .long 0xBF0C08C1 /* (1/(1+26/32+1/64)) = -.547009 */ + .long 0xBF09AE41 /* (1/(1+27/32+1/64)) = -.537815 */ + .long 0xBF0767AB /* (1/(1+28/32+1/64)) = -.528926 */ + .long 0xBF053408 /* (1/(1+29/32+1/64)) = -.520325 */ + .long 0xBF03126F /* (1/(1+30/32+1/64)) = -.512 */ + .long 0xBF010204 /* (1/(1+31/32+1/64)) = -.503937 */ + /*== _sCbrtHL ==*/ + .align 16 + .long 0x3F80A9C9 /* HI((2^0*(1+0/32+1/64))^(1/3)) = 1.005181 */ + .long 0x3F81F833 /* HI((2^0*(1+1/32+1/64))^(1/3)) = 1.015387 */ + .long 0x3F834007 /* HI((2^0*(1+2/32+1/64))^(1/3)) = 1.025391 */ + .long 0x3F848194 /* HI((2^0*(1+3/32+1/64))^(1/3)) = 1.035204 */ + .long 0x3F85BD25 /* HI((2^0*(1+4/32+1/64))^(1/3)) = 1.044835 */ + .long 0x3F86F300 /* HI((2^0*(1+5/32+1/64))^(1/3)) = 1.054291 */ + .long 0x3F882365 /* HI((2^0*(1+6/32+1/64))^(1/3)) = 1.06358 */ + .long 0x3F894E90 /* HI((2^0*(1+7/32+1/64))^(1/3)) = 1.07271 */ + .long 0x3F8A74B9 /* HI((2^0*(1+8/32+1/64))^(1/3)) = 1.081687 */ + .long 0x3F8B9615 /* HI((2^0*(1+9/32+1/64))^(1/3)) = 1.090518 */ + .long 0x3F8CB2D4 /* HI((2^0*(1+10/32+1/64))^(1/3)) = 1.099207 */ + .long 0x3F8DCB24 /* HI((2^0*(1+11/32+1/64))^(1/3)) = 1.107762 */ + .long 0x3F8EDF31 /* HI((2^0*(1+12/32+1/64))^(1/3)) = 1.116186 */ + .long 0x3F8FEF22 /* HI((2^0*(1+13/32+1/64))^(1/3)) = 1.124485 */ + .long 0x3F90FB1F /* HI((2^0*(1+14/32+1/64))^(1/3)) = 1.132664 */ + .long 0x3F92034C /* HI((2^0*(1+15/32+1/64))^(1/3)) = 1.140726 */ + .long 0x3F9307CA /* HI((2^0*(1+16/32+1/64))^(1/3)) = 1.148675 */ + .long 0x3F9408B9 /* HI((2^0*(1+17/32+1/64))^(1/3)) = 1.156516 */ + .long 0x3F950638 /* HI((2^0*(1+18/32+1/64))^(1/3)) = 1.164252 */ + .long 0x3F960064 /* HI((2^0*(1+19/32+1/64))^(1/3)) = 1.171887 */ + .long 0x3F96F759 /* HI((2^0*(1+20/32+1/64))^(1/3)) = 1.179423 */ + .long 0x3F97EB2F /* HI((2^0*(1+21/32+1/64))^(1/3)) = 1.186865 */ + .long 0x3F98DC01 /* HI((2^0*(1+22/32+1/64))^(1/3)) = 1.194214 */ + .long 0x3F99C9E5 /* HI((2^0*(1+23/32+1/64))^(1/3)) = 1.201474 */ + .long 0x3F9AB4F2 /* HI((2^0*(1+24/32+1/64))^(1/3)) = 1.208647 */ + .long 0x3F9B9D3D /* HI((2^0*(1+25/32+1/64))^(1/3)) = 1.215736 */ + .long 0x3F9C82DA /* HI((2^0*(1+26/32+1/64))^(1/3)) = 1.222743 */ + .long 0x3F9D65DD /* HI((2^0*(1+27/32+1/64))^(1/3)) = 1.229671 */ + .long 0x3F9E4659 /* HI((2^0*(1+28/32+1/64))^(1/3)) = 1.236522 */ + .long 0x3F9F245F /* HI((2^0*(1+29/32+1/64))^(1/3)) = 1.243297 */ + .long 0x3FA00000 /* HI((2^0*(1+30/32+1/64))^(1/3)) = 1.25 */ + .long 0x3FA0D94C /* HI((2^0*(1+31/32+1/64))^(1/3)) = 1.256631 */ + .long 0x3FA21B02 /* HI((2^1*(1+0/32+1/64))^(1/3)) = 1.266449 */ + .long 0x3FA3C059 /* HI((2^1*(1+1/32+1/64))^(1/3)) = 1.279307 */ + .long 0x3FA55D61 /* HI((2^1*(1+2/32+1/64))^(1/3)) = 1.291912 */ + .long 0x3FA6F282 /* HI((2^1*(1+3/32+1/64))^(1/3)) = 1.304276 */ + .long 0x3FA8801A /* HI((2^1*(1+4/32+1/64))^(1/3)) = 1.316409 */ + .long 0x3FAA067E /* HI((2^1*(1+5/32+1/64))^(1/3)) = 1.328323 */ + .long 0x3FAB8602 /* HI((2^1*(1+6/32+1/64))^(1/3)) = 1.340027 */ + .long 0x3FACFEEF /* HI((2^1*(1+7/32+1/64))^(1/3)) = 1.35153 */ + .long 0x3FAE718E /* HI((2^1*(1+8/32+1/64))^(1/3)) = 1.36284 */ + .long 0x3FAFDE1F /* HI((2^1*(1+9/32+1/64))^(1/3)) = 1.373966 */ + .long 0x3FB144E1 /* HI((2^1*(1+10/32+1/64))^(1/3)) = 1.384915 */ + .long 0x3FB2A60D /* HI((2^1*(1+11/32+1/64))^(1/3)) = 1.395692 */ + .long 0x3FB401DA /* HI((2^1*(1+12/32+1/64))^(1/3)) = 1.406307 */ + .long 0x3FB5587B /* HI((2^1*(1+13/32+1/64))^(1/3)) = 1.416763 */ + .long 0x3FB6AA20 /* HI((2^1*(1+14/32+1/64))^(1/3)) = 1.427067 */ + .long 0x3FB7F6F7 /* HI((2^1*(1+15/32+1/64))^(1/3)) = 1.437224 */ + .long 0x3FB93F29 /* HI((2^1*(1+16/32+1/64))^(1/3)) = 1.44724 */ + .long 0x3FBA82E1 /* HI((2^1*(1+17/32+1/64))^(1/3)) = 1.457119 */ + .long 0x3FBBC244 /* HI((2^1*(1+18/32+1/64))^(1/3)) = 1.466866 */ + .long 0x3FBCFD77 /* HI((2^1*(1+19/32+1/64))^(1/3)) = 1.476485 */ + .long 0x3FBE349B /* HI((2^1*(1+20/32+1/64))^(1/3)) = 1.48598 */ + .long 0x3FBF67D3 /* HI((2^1*(1+21/32+1/64))^(1/3)) = 1.495356 */ + .long 0x3FC0973C /* HI((2^1*(1+22/32+1/64))^(1/3)) = 1.504615 */ + .long 0x3FC1C2F6 /* HI((2^1*(1+23/32+1/64))^(1/3)) = 1.513762 */ + .long 0x3FC2EB1A /* HI((2^1*(1+24/32+1/64))^(1/3)) = 1.5228 */ + .long 0x3FC40FC6 /* HI((2^1*(1+25/32+1/64))^(1/3)) = 1.531731 */ + .long 0x3FC53112 /* HI((2^1*(1+26/32+1/64))^(1/3)) = 1.54056 */ + .long 0x3FC64F16 /* HI((2^1*(1+27/32+1/64))^(1/3)) = 1.549289 */ + .long 0x3FC769EB /* HI((2^1*(1+28/32+1/64))^(1/3)) = 1.55792 */ + .long 0x3FC881A6 /* HI((2^1*(1+29/32+1/64))^(1/3)) = 1.566457 */ + .long 0x3FC9965D /* HI((2^1*(1+30/32+1/64))^(1/3)) = 1.574901 */ + .long 0x3FCAA825 /* HI((2^1*(1+31/32+1/64))^(1/3)) = 1.583256 */ + .long 0x3FCC3D79 /* HI((2^2*(1+0/32+1/64))^(1/3)) = 1.595626 */ + .long 0x3FCE5054 /* HI((2^2*(1+1/32+1/64))^(1/3)) = 1.611826 */ + .long 0x3FD058B8 /* HI((2^2*(1+2/32+1/64))^(1/3)) = 1.627707 */ + .long 0x3FD25726 /* HI((2^2*(1+3/32+1/64))^(1/3)) = 1.643285 */ + .long 0x3FD44C15 /* HI((2^2*(1+4/32+1/64))^(1/3)) = 1.658572 */ + .long 0x3FD637F2 /* HI((2^2*(1+5/32+1/64))^(1/3)) = 1.673582 */ + .long 0x3FD81B24 /* HI((2^2*(1+6/32+1/64))^(1/3)) = 1.688328 */ + .long 0x3FD9F60B /* HI((2^2*(1+7/32+1/64))^(1/3)) = 1.702821 */ + .long 0x3FDBC8FE /* HI((2^2*(1+8/32+1/64))^(1/3)) = 1.717071 */ + .long 0x3FDD9452 /* HI((2^2*(1+9/32+1/64))^(1/3)) = 1.731089 */ + .long 0x3FDF5853 /* HI((2^2*(1+10/32+1/64))^(1/3)) = 1.744883 */ + .long 0x3FE1154B /* HI((2^2*(1+11/32+1/64))^(1/3)) = 1.758462 */ + .long 0x3FE2CB7F /* HI((2^2*(1+12/32+1/64))^(1/3)) = 1.771835 */ + .long 0x3FE47B2E /* HI((2^2*(1+13/32+1/64))^(1/3)) = 1.785009 */ + .long 0x3FE62496 /* HI((2^2*(1+14/32+1/64))^(1/3)) = 1.797992 */ + .long 0x3FE7C7F0 /* HI((2^2*(1+15/32+1/64))^(1/3)) = 1.810789 */ + .long 0x3FE96571 /* HI((2^2*(1+16/32+1/64))^(1/3)) = 1.823408 */ + .long 0x3FEAFD4C /* HI((2^2*(1+17/32+1/64))^(1/3)) = 1.835855 */ + .long 0x3FEC8FB3 /* HI((2^2*(1+18/32+1/64))^(1/3)) = 1.848135 */ + .long 0x3FEE1CD3 /* HI((2^2*(1+19/32+1/64))^(1/3)) = 1.860255 */ + .long 0x3FEFA4D7 /* HI((2^2*(1+20/32+1/64))^(1/3)) = 1.872218 */ + .long 0x3FF127E9 /* HI((2^2*(1+21/32+1/64))^(1/3)) = 1.88403 */ + .long 0x3FF2A62F /* HI((2^2*(1+22/32+1/64))^(1/3)) = 1.895697 */ + .long 0x3FF41FD0 /* HI((2^2*(1+23/32+1/64))^(1/3)) = 1.907221 */ + .long 0x3FF594EE /* HI((2^2*(1+24/32+1/64))^(1/3)) = 1.918607 */ + .long 0x3FF705AC /* HI((2^2*(1+25/32+1/64))^(1/3)) = 1.929861 */ + .long 0x3FF8722A /* HI((2^2*(1+26/32+1/64))^(1/3)) = 1.940984 */ + .long 0x3FF9DA86 /* HI((2^2*(1+27/32+1/64))^(1/3)) = 1.951981 */ + .long 0x3FFB3EDE /* HI((2^2*(1+28/32+1/64))^(1/3)) = 1.962856 */ + .long 0x3FFC9F4E /* HI((2^2*(1+29/32+1/64))^(1/3)) = 1.973612 */ + .long 0x3FFDFBF2 /* HI((2^2*(1+30/32+1/64))^(1/3)) = 1.984251 */ + .long 0x3FFF54E3 /* HI((2^2*(1+31/32+1/64))^(1/3)) = 1.994778 */ + .align 16 + .long 0xBDE3A962, 0xBDE3A962, 0xBDE3A962, 0xBDE3A962 /* _sP2 */ + .align 16 + .long 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91 /* _sP1 */ + .align 16 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff /* _sMantissaMask (EXP_MSK3) */ + .align 16 + .long 0x007e0000, 0x007e0000, 0x007e0000, 0x007e0000 /* _sMantissaMask1 (SIG_MASK) */ + .align 16 + .long 0xBF800000, 0xBF800000, 0xBF800000, 0xBF800000 /* _sExpMask (EXP_MASK) */ + .align 16 + .long 0xBF820000, 0xBF820000, 0xBF820000, 0xBF820000 /* _sExpMask1 (EXP_MASK2) */ + .align 16 + .long 0x0000007c, 0x0000007c, 0x0000007c, 0x0000007c /* _iRcpIndexMask */ + .align 16 + .long 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff /* _iBExpMask */ + .align 16 + .long 0x00000100, 0x00000100, 0x00000100, 0x00000100 /* _iSignMask */ + .align 16 + .long 0x00000055, 0x00000055, 0x00000055, 0x00000055 /* _iBias */ + .align 16 + .long 0x00000001, 0x00000001, 0x00000001, 0x00000001 /* _iOne */ + .align 16 + .long 0x00000555, 0x00000555, 0x00000555, 0x00000555 /* _i555 */ + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 16 + .long 0x80800000, 0x80800000, 0x80800000, 0x80800000 /* _iSubConst */ + .align 16 + .long 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF /* _iCmpConst */ + .align 16 + .type __svml_scbrt_data_internal,@object + .size __svml_scbrt_data_internal,.-__svml_scbrt_data_internal + .align 16 + +.FLT_17: + .long 0xffffffff,0x00000000,0xffffffff,0x00000000 + .type .FLT_17,@object + .size .FLT_17,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core-sse.S new file mode 100644 index 0000000000..8eaa457fa6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized cbrtf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_cbrtf _ZGVdN8v_cbrtf_sse_wrapper +#include "../svml_s_cbrtf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core.c new file mode 100644 index 0000000000..089d28461f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized cbrtf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_cbrtf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_cbrtf, __GI__ZGVdN8v_cbrtf, + __redirect__ZGVdN8v_cbrtf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core_avx2.S new file mode 100644 index 0000000000..acd20d9db8 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_cbrtf8_core_avx2.S @@ -0,0 +1,509 @@ +/* Function cbrtf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * x=2^{3*k+j} * 1.b1 b2 ... b5 b6 ... b52 + * Let r=(x*2^{-3k-j} - 1.b1 b2 ... b5 1)* rcp[b1 b2 ..b5], + * where rcp[b1 b2 .. b5]=1/(1.b1 b2 b3 b4 b5 1) in single precision + * cbrtf(2^j * 1. b1 b2 .. b5 1) is approximated as T[j][b1..b5]+D[j][b1..b5] + * (T stores the high 24 bits, D stores the low order bits) + * Result=2^k*T+(2^k*T*r)*P+2^k*D + * where P=p1+p2*r+.. + * + */ + +/* Offsets for data table __svml_scbrt_data_internal + */ +#define _sRcp 0 +#define _sCbrtHL 128 +#define _sP2 512 +#define _sP1 544 +#define _sMantissaMask 576 +#define _sMantissaMask1 608 +#define _sExpMask 640 +#define _sExpMask1 672 +#define _iRcpIndexMask 704 +#define _iBExpMask 736 +#define _iSignMask 768 +#define _iBias 800 +#define _iOne 832 +#define _i555 864 +#define _iAbsMask 896 +#define _iSubConst 928 +#define _iCmpConst 960 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_cbrtf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + +/* Load reciprocal value */ + lea __svml_scbrt_data_internal(%rip), %rdx + vmovaps %ymm0, %ymm5 + +/* + * Load constants + * Reciprocal index calculation + */ + vpsrld $16, %ymm5, %ymm3 + vpand _iRcpIndexMask+__svml_scbrt_data_internal(%rip), %ymm3, %ymm4 + vextractf128 $1, %ymm4, %xmm15 + vmovd %xmm4, %eax + vmovd %xmm15, %r8d + vpextrd $1, %xmm15, %r9d + vpextrd $2, %xmm15, %r10d + vpextrd $3, %xmm15, %r11d + movslq %r8d, %r8 + movslq %r9d, %r9 + movslq %r10d, %r10 + movslq %r11d, %r11 + vpextrd $1, %xmm4, %ecx + vpextrd $2, %xmm4, %esi + vpextrd $3, %xmm4, %edi + movslq %eax, %rax + movslq %ecx, %rcx + movslq %esi, %rsi + movslq %edi, %rdi + vmovd (%rdx,%r8), %xmm13 + vmovd (%rdx,%r9), %xmm14 + vmovd (%rdx,%r10), %xmm1 + vmovd (%rdx,%r11), %xmm0 + vpunpckldq %xmm14, %xmm13, %xmm2 + vpunpckldq %xmm0, %xmm1, %xmm13 + +/* Get signed biased exponent */ + vpsrld $7, %ymm3, %ymm0 + vmovd (%rdx,%rax), %xmm6 + vmovd (%rdx,%rcx), %xmm7 + vmovd (%rdx,%rsi), %xmm8 + vmovd (%rdx,%rdi), %xmm9 + vpunpckldq %xmm7, %xmm6, %xmm10 + vpunpckldq %xmm9, %xmm8, %xmm11 + vpunpcklqdq %xmm11, %xmm10, %xmm12 + vpunpcklqdq %xmm13, %xmm2, %xmm6 + vandps _iAbsMask+__svml_scbrt_data_internal(%rip), %ymm5, %ymm3 + +/* Argument reduction */ + vandps _sMantissaMask+__svml_scbrt_data_internal(%rip), %ymm5, %ymm8 + vandps _sMantissaMask1+__svml_scbrt_data_internal(%rip), %ymm5, %ymm9 + vpsubd _iSubConst+__svml_scbrt_data_internal(%rip), %ymm3, %ymm7 + vorps _sExpMask+__svml_scbrt_data_internal(%rip), %ymm8, %ymm10 + vorps _sExpMask1+__svml_scbrt_data_internal(%rip), %ymm9, %ymm11 + +/* r=y-y` */ + vsubps %ymm11, %ymm10, %ymm15 + +/* Biased exponent-1 */ + vpand _iSignMask+__svml_scbrt_data_internal(%rip), %ymm0, %ymm8 + vpcmpgtd _iCmpConst+__svml_scbrt_data_internal(%rip), %ymm7, %ymm2 + vmovmskps %ymm2, %eax + vinsertf128 $1, %xmm6, %ymm12, %ymm14 + +/* Get absolute biased exponent */ + vpand _iBExpMask+__svml_scbrt_data_internal(%rip), %ymm0, %ymm6 + +/* r=(y-y`)*rcp_table(y`) */ + vmulps %ymm15, %ymm14, %ymm1 + vpsubd _iOne+__svml_scbrt_data_internal(%rip), %ymm6, %ymm10 + +/* + * Calculate exponent/3 + * i555Exp=(2^{12}-1)/3*exponent + */ + vpmulld _i555+__svml_scbrt_data_internal(%rip), %ymm6, %ymm3 + +/* Get K (exponent=3*k+j) */ + vpsrld $12, %ymm3, %ymm13 + +/* Get J */ + vpsubd %ymm13, %ymm10, %ymm11 + +/* Add 2/3*(bias-1)+1 to (k+1/3*(bias-1)) */ + vpaddd _iBias+__svml_scbrt_data_internal(%rip), %ymm13, %ymm7 + vpsubd %ymm13, %ymm11, %ymm12 + +/* Attach sign to exponent */ + vpor %ymm8, %ymm7, %ymm9 + vpsubd %ymm13, %ymm12, %ymm14 + vpslld $23, %ymm9, %ymm0 + +/* Get 128*J */ + vpslld $7, %ymm14, %ymm15 + +/* iCbrtIndex=4*l+128*j */ + vpaddd %ymm15, %ymm4, %ymm4 + +/* Zero index if callout expected */ + vpandn %ymm4, %ymm2, %ymm4 + +/* Load Cbrt table Hi & Lo values */ + vmovd %xmm4, %ecx + vextractf128 $1, %ymm4, %xmm13 + vpextrd $1, %xmm4, %esi + movslq %ecx, %rcx + movslq %esi, %rsi + vmovd %xmm13, %r9d + vmovd 128(%rdx,%rcx), %xmm2 + vpextrd $2, %xmm4, %edi + vpextrd $3, %xmm4, %r8d + vmovd 128(%rdx,%rsi), %xmm3 + vpextrd $1, %xmm13, %r10d + vpextrd $2, %xmm13, %ecx + vpextrd $3, %xmm13, %esi + movslq %edi, %rdi + movslq %r8d, %r8 + movslq %r9d, %r9 + movslq %r10d, %r10 + movslq %ecx, %rcx + movslq %esi, %rsi + vmovd 128(%rdx,%rdi), %xmm6 + vmovd 128(%rdx,%r8), %xmm7 + vmovd 128(%rdx,%r9), %xmm11 + vmovd 128(%rdx,%r10), %xmm12 + vmovd 128(%rdx,%rcx), %xmm14 + vmovd 128(%rdx,%rsi), %xmm15 + vpunpckldq %xmm3, %xmm2, %xmm8 + vpunpckldq %xmm7, %xmm6, %xmm9 + vpunpckldq %xmm12, %xmm11, %xmm4 + vpunpckldq %xmm15, %xmm14, %xmm11 + vpunpcklqdq %xmm9, %xmm8, %xmm10 + vpunpcklqdq %xmm11, %xmm4, %xmm2 + vinsertf128 $1, %xmm2, %ymm10, %ymm3 + +/* sCbrtHi *= 2^k */ + vmulps %ymm3, %ymm0, %ymm2 + +/* Polynomial: p1+r*(p2*r+r*(p3+r*p4)) */ + vmovups _sP2+__svml_scbrt_data_internal(%rip), %ymm0 + vfmadd213ps _sP1+__svml_scbrt_data_internal(%rip), %ymm1, %ymm0 + +/* T`*r */ + vmulps %ymm2, %ymm1, %ymm1 + +/* (T`*r)*P */ + vmulps %ymm1, %ymm0, %ymm0 + +/* + * T`*r*P+D` + * result = T`+(T`*r*P+D`) + */ + vaddps %ymm0, %ymm2, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm5 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm5, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call cbrtf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_cbrtf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_scbrt_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _sRcp[32][1]; + __declspec(align(32)) VUINT32 _sCbrtHL[96][1]; + __declspec(align(32)) VUINT32 _sP2[8][1]; + __declspec(align(32)) VUINT32 _sP1[8][1]; + __declspec(align(32)) VUINT32 _sMantissaMask[8][1]; + __declspec(align(32)) VUINT32 _sMantissaMask1[8][1]; + __declspec(align(32)) VUINT32 _sExpMask[8][1]; + __declspec(align(32)) VUINT32 _sExpMask1[8][1]; + __declspec(align(32)) VUINT32 _iRcpIndexMask[8][1]; + __declspec(align(32)) VUINT32 _iBExpMask[8][1]; + __declspec(align(32)) VUINT32 _iSignMask[8][1]; + __declspec(align(32)) VUINT32 _iBias[8][1]; + __declspec(align(32)) VUINT32 _iOne[8][1]; + __declspec(align(32)) VUINT32 _i555[8][1]; + __declspec(align(32)) VUINT32 _iAbsMask[8][1]; + __declspec(align(32)) VUINT32 _iSubConst[8][1]; + __declspec(align(32)) VUINT32 _iCmpConst[8][1]; +} __svml_scbrt_data_internal; +#endif +__svml_scbrt_data_internal: + /*== _sRcp ==*/ + .long 0xBF7C0FC1 /* (1/(1+0/32+1/64)) = -.984615 */ + .long 0xBF74898D /* (1/(1+1/32+1/64)) = -.955224 */ + .long 0xBF6D7304 /* (1/(1+2/32+1/64)) = -.927536 */ + .long 0xBF66C2B4 /* (1/(1+3/32+1/64)) = -.901408 */ + .long 0xBF607038 /* (1/(1+4/32+1/64)) = -.876712 */ + .long 0xBF5A740E /* (1/(1+5/32+1/64)) = -.853333 */ + .long 0xBF54C77B /* (1/(1+6/32+1/64)) = -.831169 */ + .long 0xBF4F6475 /* (1/(1+7/32+1/64)) = -.810127 */ + .long 0xBF4A4588 /* (1/(1+8/32+1/64)) = -.790123 */ + .long 0xBF4565C8 /* (1/(1+9/32+1/64)) = -.771084 */ + .long 0xBF40C0C1 /* (1/(1+10/32+1/64)) = -.752941 */ + .long 0xBF3C5264 /* (1/(1+11/32+1/64)) = -.735632 */ + .long 0xBF381703 /* (1/(1+12/32+1/64)) = -.719101 */ + .long 0xBF340B41 /* (1/(1+13/32+1/64)) = -.703297 */ + .long 0xBF302C0B /* (1/(1+14/32+1/64)) = -.688172 */ + .long 0xBF2C7692 /* (1/(1+15/32+1/64)) = -.673684 */ + .long 0xBF28E83F /* (1/(1+16/32+1/64)) = -.659794 */ + .long 0xBF257EB5 /* (1/(1+17/32+1/64)) = -.646465 */ + .long 0xBF2237C3 /* (1/(1+18/32+1/64)) = -.633663 */ + .long 0xBF1F1166 /* (1/(1+19/32+1/64)) = -.621359 */ + .long 0xBF1C09C1 /* (1/(1+20/32+1/64)) = -.609524 */ + .long 0xBF191F1A /* (1/(1+21/32+1/64)) = -.598131 */ + .long 0xBF164FDA /* (1/(1+22/32+1/64)) = -.587156 */ + .long 0xBF139A86 /* (1/(1+23/32+1/64)) = -.576577 */ + .long 0xBF10FDBC /* (1/(1+24/32+1/64)) = -.566372 */ + .long 0xBF0E7835 /* (1/(1+25/32+1/64)) = -.556522 */ + .long 0xBF0C08C1 /* (1/(1+26/32+1/64)) = -.547009 */ + .long 0xBF09AE41 /* (1/(1+27/32+1/64)) = -.537815 */ + .long 0xBF0767AB /* (1/(1+28/32+1/64)) = -.528926 */ + .long 0xBF053408 /* (1/(1+29/32+1/64)) = -.520325 */ + .long 0xBF03126F /* (1/(1+30/32+1/64)) = -.512 */ + .long 0xBF010204 /* (1/(1+31/32+1/64)) = -.503937 */ + /*== _sCbrtHL ==*/ + .align 32 + .long 0x3F80A9C9 /* HI((2^0*(1+0/32+1/64))^(1/3)) = 1.005181 */ + .long 0x3F81F833 /* HI((2^0*(1+1/32+1/64))^(1/3)) = 1.015387 */ + .long 0x3F834007 /* HI((2^0*(1+2/32+1/64))^(1/3)) = 1.025391 */ + .long 0x3F848194 /* HI((2^0*(1+3/32+1/64))^(1/3)) = 1.035204 */ + .long 0x3F85BD25 /* HI((2^0*(1+4/32+1/64))^(1/3)) = 1.044835 */ + .long 0x3F86F300 /* HI((2^0*(1+5/32+1/64))^(1/3)) = 1.054291 */ + .long 0x3F882365 /* HI((2^0*(1+6/32+1/64))^(1/3)) = 1.06358 */ + .long 0x3F894E90 /* HI((2^0*(1+7/32+1/64))^(1/3)) = 1.07271 */ + .long 0x3F8A74B9 /* HI((2^0*(1+8/32+1/64))^(1/3)) = 1.081687 */ + .long 0x3F8B9615 /* HI((2^0*(1+9/32+1/64))^(1/3)) = 1.090518 */ + .long 0x3F8CB2D4 /* HI((2^0*(1+10/32+1/64))^(1/3)) = 1.099207 */ + .long 0x3F8DCB24 /* HI((2^0*(1+11/32+1/64))^(1/3)) = 1.107762 */ + .long 0x3F8EDF31 /* HI((2^0*(1+12/32+1/64))^(1/3)) = 1.116186 */ + .long 0x3F8FEF22 /* HI((2^0*(1+13/32+1/64))^(1/3)) = 1.124485 */ + .long 0x3F90FB1F /* HI((2^0*(1+14/32+1/64))^(1/3)) = 1.132664 */ + .long 0x3F92034C /* HI((2^0*(1+15/32+1/64))^(1/3)) = 1.140726 */ + .long 0x3F9307CA /* HI((2^0*(1+16/32+1/64))^(1/3)) = 1.148675 */ + .long 0x3F9408B9 /* HI((2^0*(1+17/32+1/64))^(1/3)) = 1.156516 */ + .long 0x3F950638 /* HI((2^0*(1+18/32+1/64))^(1/3)) = 1.164252 */ + .long 0x3F960064 /* HI((2^0*(1+19/32+1/64))^(1/3)) = 1.171887 */ + .long 0x3F96F759 /* HI((2^0*(1+20/32+1/64))^(1/3)) = 1.179423 */ + .long 0x3F97EB2F /* HI((2^0*(1+21/32+1/64))^(1/3)) = 1.186865 */ + .long 0x3F98DC01 /* HI((2^0*(1+22/32+1/64))^(1/3)) = 1.194214 */ + .long 0x3F99C9E5 /* HI((2^0*(1+23/32+1/64))^(1/3)) = 1.201474 */ + .long 0x3F9AB4F2 /* HI((2^0*(1+24/32+1/64))^(1/3)) = 1.208647 */ + .long 0x3F9B9D3D /* HI((2^0*(1+25/32+1/64))^(1/3)) = 1.215736 */ + .long 0x3F9C82DA /* HI((2^0*(1+26/32+1/64))^(1/3)) = 1.222743 */ + .long 0x3F9D65DD /* HI((2^0*(1+27/32+1/64))^(1/3)) = 1.229671 */ + .long 0x3F9E4659 /* HI((2^0*(1+28/32+1/64))^(1/3)) = 1.236522 */ + .long 0x3F9F245F /* HI((2^0*(1+29/32+1/64))^(1/3)) = 1.243297 */ + .long 0x3FA00000 /* HI((2^0*(1+30/32+1/64))^(1/3)) = 1.25 */ + .long 0x3FA0D94C /* HI((2^0*(1+31/32+1/64))^(1/3)) = 1.256631 */ + .long 0x3FA21B02 /* HI((2^1*(1+0/32+1/64))^(1/3)) = 1.266449 */ + .long 0x3FA3C059 /* HI((2^1*(1+1/32+1/64))^(1/3)) = 1.279307 */ + .long 0x3FA55D61 /* HI((2^1*(1+2/32+1/64))^(1/3)) = 1.291912 */ + .long 0x3FA6F282 /* HI((2^1*(1+3/32+1/64))^(1/3)) = 1.304276 */ + .long 0x3FA8801A /* HI((2^1*(1+4/32+1/64))^(1/3)) = 1.316409 */ + .long 0x3FAA067E /* HI((2^1*(1+5/32+1/64))^(1/3)) = 1.328323 */ + .long 0x3FAB8602 /* HI((2^1*(1+6/32+1/64))^(1/3)) = 1.340027 */ + .long 0x3FACFEEF /* HI((2^1*(1+7/32+1/64))^(1/3)) = 1.35153 */ + .long 0x3FAE718E /* HI((2^1*(1+8/32+1/64))^(1/3)) = 1.36284 */ + .long 0x3FAFDE1F /* HI((2^1*(1+9/32+1/64))^(1/3)) = 1.373966 */ + .long 0x3FB144E1 /* HI((2^1*(1+10/32+1/64))^(1/3)) = 1.384915 */ + .long 0x3FB2A60D /* HI((2^1*(1+11/32+1/64))^(1/3)) = 1.395692 */ + .long 0x3FB401DA /* HI((2^1*(1+12/32+1/64))^(1/3)) = 1.406307 */ + .long 0x3FB5587B /* HI((2^1*(1+13/32+1/64))^(1/3)) = 1.416763 */ + .long 0x3FB6AA20 /* HI((2^1*(1+14/32+1/64))^(1/3)) = 1.427067 */ + .long 0x3FB7F6F7 /* HI((2^1*(1+15/32+1/64))^(1/3)) = 1.437224 */ + .long 0x3FB93F29 /* HI((2^1*(1+16/32+1/64))^(1/3)) = 1.44724 */ + .long 0x3FBA82E1 /* HI((2^1*(1+17/32+1/64))^(1/3)) = 1.457119 */ + .long 0x3FBBC244 /* HI((2^1*(1+18/32+1/64))^(1/3)) = 1.466866 */ + .long 0x3FBCFD77 /* HI((2^1*(1+19/32+1/64))^(1/3)) = 1.476485 */ + .long 0x3FBE349B /* HI((2^1*(1+20/32+1/64))^(1/3)) = 1.48598 */ + .long 0x3FBF67D3 /* HI((2^1*(1+21/32+1/64))^(1/3)) = 1.495356 */ + .long 0x3FC0973C /* HI((2^1*(1+22/32+1/64))^(1/3)) = 1.504615 */ + .long 0x3FC1C2F6 /* HI((2^1*(1+23/32+1/64))^(1/3)) = 1.513762 */ + .long 0x3FC2EB1A /* HI((2^1*(1+24/32+1/64))^(1/3)) = 1.5228 */ + .long 0x3FC40FC6 /* HI((2^1*(1+25/32+1/64))^(1/3)) = 1.531731 */ + .long 0x3FC53112 /* HI((2^1*(1+26/32+1/64))^(1/3)) = 1.54056 */ + .long 0x3FC64F16 /* HI((2^1*(1+27/32+1/64))^(1/3)) = 1.549289 */ + .long 0x3FC769EB /* HI((2^1*(1+28/32+1/64))^(1/3)) = 1.55792 */ + .long 0x3FC881A6 /* HI((2^1*(1+29/32+1/64))^(1/3)) = 1.566457 */ + .long 0x3FC9965D /* HI((2^1*(1+30/32+1/64))^(1/3)) = 1.574901 */ + .long 0x3FCAA825 /* HI((2^1*(1+31/32+1/64))^(1/3)) = 1.583256 */ + .long 0x3FCC3D79 /* HI((2^2*(1+0/32+1/64))^(1/3)) = 1.595626 */ + .long 0x3FCE5054 /* HI((2^2*(1+1/32+1/64))^(1/3)) = 1.611826 */ + .long 0x3FD058B8 /* HI((2^2*(1+2/32+1/64))^(1/3)) = 1.627707 */ + .long 0x3FD25726 /* HI((2^2*(1+3/32+1/64))^(1/3)) = 1.643285 */ + .long 0x3FD44C15 /* HI((2^2*(1+4/32+1/64))^(1/3)) = 1.658572 */ + .long 0x3FD637F2 /* HI((2^2*(1+5/32+1/64))^(1/3)) = 1.673582 */ + .long 0x3FD81B24 /* HI((2^2*(1+6/32+1/64))^(1/3)) = 1.688328 */ + .long 0x3FD9F60B /* HI((2^2*(1+7/32+1/64))^(1/3)) = 1.702821 */ + .long 0x3FDBC8FE /* HI((2^2*(1+8/32+1/64))^(1/3)) = 1.717071 */ + .long 0x3FDD9452 /* HI((2^2*(1+9/32+1/64))^(1/3)) = 1.731089 */ + .long 0x3FDF5853 /* HI((2^2*(1+10/32+1/64))^(1/3)) = 1.744883 */ + .long 0x3FE1154B /* HI((2^2*(1+11/32+1/64))^(1/3)) = 1.758462 */ + .long 0x3FE2CB7F /* HI((2^2*(1+12/32+1/64))^(1/3)) = 1.771835 */ + .long 0x3FE47B2E /* HI((2^2*(1+13/32+1/64))^(1/3)) = 1.785009 */ + .long 0x3FE62496 /* HI((2^2*(1+14/32+1/64))^(1/3)) = 1.797992 */ + .long 0x3FE7C7F0 /* HI((2^2*(1+15/32+1/64))^(1/3)) = 1.810789 */ + .long 0x3FE96571 /* HI((2^2*(1+16/32+1/64))^(1/3)) = 1.823408 */ + .long 0x3FEAFD4C /* HI((2^2*(1+17/32+1/64))^(1/3)) = 1.835855 */ + .long 0x3FEC8FB3 /* HI((2^2*(1+18/32+1/64))^(1/3)) = 1.848135 */ + .long 0x3FEE1CD3 /* HI((2^2*(1+19/32+1/64))^(1/3)) = 1.860255 */ + .long 0x3FEFA4D7 /* HI((2^2*(1+20/32+1/64))^(1/3)) = 1.872218 */ + .long 0x3FF127E9 /* HI((2^2*(1+21/32+1/64))^(1/3)) = 1.88403 */ + .long 0x3FF2A62F /* HI((2^2*(1+22/32+1/64))^(1/3)) = 1.895697 */ + .long 0x3FF41FD0 /* HI((2^2*(1+23/32+1/64))^(1/3)) = 1.907221 */ + .long 0x3FF594EE /* HI((2^2*(1+24/32+1/64))^(1/3)) = 1.918607 */ + .long 0x3FF705AC /* HI((2^2*(1+25/32+1/64))^(1/3)) = 1.929861 */ + .long 0x3FF8722A /* HI((2^2*(1+26/32+1/64))^(1/3)) = 1.940984 */ + .long 0x3FF9DA86 /* HI((2^2*(1+27/32+1/64))^(1/3)) = 1.951981 */ + .long 0x3FFB3EDE /* HI((2^2*(1+28/32+1/64))^(1/3)) = 1.962856 */ + .long 0x3FFC9F4E /* HI((2^2*(1+29/32+1/64))^(1/3)) = 1.973612 */ + .long 0x3FFDFBF2 /* HI((2^2*(1+30/32+1/64))^(1/3)) = 1.984251 */ + .long 0x3FFF54E3 /* HI((2^2*(1+31/32+1/64))^(1/3)) = 1.994778 */ + .align 32 + .long 0xBDE3A962, 0xBDE3A962, 0xBDE3A962, 0xBDE3A962, 0xBDE3A962, 0xBDE3A962, 0xBDE3A962, 0xBDE3A962 /* _sP2 */ + .align 32 + .long 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91, 0x3EAAAC91 /* _sP1 */ + .align 32 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff /* _sMantissaMask (EXP_MSK3) */ + .align 32 + .long 0x007e0000, 0x007e0000, 0x007e0000, 0x007e0000, 0x007e0000, 0x007e0000, 0x007e0000, 0x007e0000 /* _sMantissaMask1 (SIG_MASK) */ + .align 32 + .long 0xBF800000, 0xBF800000, 0xBF800000, 0xBF800000, 0xBF800000, 0xBF800000, 0xBF800000, 0xBF800000 /* _sExpMask (EXP_MASK) */ + .align 32 + .long 0xBF820000, 0xBF820000, 0xBF820000, 0xBF820000, 0xBF820000, 0xBF820000, 0xBF820000, 0xBF820000 /* _sExpMask1 (EXP_MASK2) */ + .align 32 + .long 0x0000007c, 0x0000007c, 0x0000007c, 0x0000007c, 0x0000007c, 0x0000007c, 0x0000007c, 0x0000007c /* _iRcpIndexMask */ + .align 32 + .long 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff /* _iBExpMask */ + .align 32 + .long 0x00000100, 0x00000100, 0x00000100, 0x00000100, 0x00000100, 0x00000100, 0x00000100, 0x00000100 /* _iSignMask */ + .align 32 + .long 0x00000055, 0x00000055, 0x00000055, 0x00000055, 0x00000055, 0x00000055, 0x00000055, 0x00000055 /* _iBias */ + .align 32 + .long 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001, 0x00000001 /* _iOne */ + .align 32 + .long 0x00000555, 0x00000555, 0x00000555, 0x00000555, 0x00000555, 0x00000555, 0x00000555, 0x00000555 /* _i555 */ + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _iAbsMask */ + .align 32 + .long 0x80800000, 0x80800000, 0x80800000, 0x80800000, 0x80800000, 0x80800000, 0x80800000, 0x80800000 /* _iSubConst */ + .align 32 + .long 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF, 0xFEFFFFFF /* _iCmpConst */ + .align 32 + .type __svml_scbrt_data_internal,@object + .size __svml_scbrt_data_internal,.-__svml_scbrt_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_cbrt2_core.S b/sysdeps/x86_64/fpu/svml_d_cbrt2_core.S new file mode 100644 index 0000000000..4bf546564b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cbrt2_core.S @@ -0,0 +1,29 @@ +/* Function cbrt vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_cbrt) +WRAPPER_IMPL_SSE2 cbrt +END (_ZGVbN2v_cbrt) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_cbrt) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_cbrt4_core.S b/sysdeps/x86_64/fpu/svml_d_cbrt4_core.S new file mode 100644 index 0000000000..e6d1003e27 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cbrt4_core.S @@ -0,0 +1,29 @@ +/* Function cbrt vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_cbrt) +WRAPPER_IMPL_AVX _ZGVbN2v_cbrt +END (_ZGVdN4v_cbrt) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_cbrt) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_cbrt4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_cbrt4_core_avx.S new file mode 100644 index 0000000000..70632869ac --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cbrt4_core_avx.S @@ -0,0 +1,25 @@ +/* Function cbrt vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_cbrt) +WRAPPER_IMPL_AVX _ZGVbN2v_cbrt +END (_ZGVcN4v_cbrt) diff --git a/sysdeps/x86_64/fpu/svml_d_cbrt8_core.S b/sysdeps/x86_64/fpu/svml_d_cbrt8_core.S new file mode 100644 index 0000000000..37571673a7 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_cbrt8_core.S @@ -0,0 +1,25 @@ +/* Function cbrt vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_cbrt) +WRAPPER_IMPL_AVX512 _ZGVdN4v_cbrt +END (_ZGVeN8v_cbrt) diff --git a/sysdeps/x86_64/fpu/svml_s_cbrtf16_core.S b/sysdeps/x86_64/fpu/svml_s_cbrtf16_core.S new file mode 100644 index 0000000000..1be6294026 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_cbrtf16_core.S @@ -0,0 +1,25 @@ +/* Function cbrtf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_cbrtf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_cbrtf +END (_ZGVeN16v_cbrtf) diff --git a/sysdeps/x86_64/fpu/svml_s_cbrtf4_core.S b/sysdeps/x86_64/fpu/svml_s_cbrtf4_core.S new file mode 100644 index 0000000000..2469a100f4 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_cbrtf4_core.S @@ -0,0 +1,29 @@ +/* Function cbrtf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_cbrtf) +WRAPPER_IMPL_SSE2 cbrtf +END (_ZGVbN4v_cbrtf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_cbrtf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_cbrtf8_core.S b/sysdeps/x86_64/fpu/svml_s_cbrtf8_core.S new file mode 100644 index 0000000000..efedc22323 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_cbrtf8_core.S @@ -0,0 +1,29 @@ +/* Function cbrtf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_cbrtf) +WRAPPER_IMPL_AVX _ZGVbN4v_cbrtf +END (_ZGVdN8v_cbrtf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_cbrtf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_cbrtf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_cbrtf8_core_avx.S new file mode 100644 index 0000000000..b5acc62426 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_cbrtf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function cbrtf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_cbrtf) +WRAPPER_IMPL_AVX _ZGVbN4v_cbrtf +END (_ZGVcN8v_cbrtf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx.c new file mode 100644 index 0000000000..c8bc643c99 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-cbrt.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx2.c new file mode 100644 index 0000000000..c8bc643c99 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-cbrt.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx512f.c new file mode 100644 index 0000000000..c8bc643c99 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-cbrt.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-cbrt.c b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt.c new file mode 100644 index 0000000000..fb3684b18c --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-cbrt.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC cbrt +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index db136cc901..b1981ac7e4 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVbN2v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVbN2v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVbN2v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVbN2v_sinh) +VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVbN2v_cbrt) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 5fc09ac8c0..47915a7e59 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVdN4v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVdN4v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVdN4v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVdN4v_sinh) +VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVdN4v_cbrt) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 26ef7fb365..5cd5049807 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVcN4v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVcN4v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVcN4v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVcN4v_sinh) +VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVcN4v_cbrt) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index c7055fca76..83970739ab 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10), _ZGVeN8v_exp10) VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVeN8v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVeN8v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVeN8v_sinh) +VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVeN8v_cbrt) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx.c new file mode 100644 index 0000000000..59b8d77f71 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-cbrtf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx2.c new file mode 100644 index 0000000000..59b8d77f71 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-cbrtf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx512f.c new file mode 100644 index 0000000000..59b8d77f71 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-cbrtf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf.c b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf.c new file mode 100644 index 0000000000..3a06ba79e0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-cbrtf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC cbrtf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index d353bcb0f2..0420f11c28 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVeN16v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVeN16v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVeN16v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVeN16v_sinhf) +VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVeN16v_cbrtf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 5e59117626..c8f7580265 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVbN4v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVbN4v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVbN4v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVbN4v_sinhf) +VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVbN4v_cbrtf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index e884a5f4df..b581796b88 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVdN8v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVdN8v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVdN8v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVdN8v_sinhf) +VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVdN8v_cbrtf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 95910d39e9..f16789e5ff 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -36,6 +36,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (exp10f), _ZGVcN8v_exp10f) VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVcN8v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVcN8v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVcN8v_sinhf) +VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVcN8v_cbrtf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573820 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=YHmLzcbm; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmBh46DHz9sVq for ; Wed, 29 Dec 2021 07:21:00 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 7349F3858417 for ; Tue, 28 Dec 2021 20:20:58 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7349F3858417 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722858; bh=3f+z+8TDmwZSVFX2xUiz5F8KIec2zl4C4FEUn55bsXM=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=YHmLzcbm6TpwG6CSqSSw8NBXL7z/gXBFkD4OtV3Q4qXFs3anyHJmHFRiuVdhzVrcE ULa6ZZrJ+AqcjwzB1em+5+WP6132+jewP9AQdSaeOk8Mx1vYf/80viNFvEJhq55/No kUGlinorEbo9QraRdW86N3ZYQML8mCKdL8LGzI6g= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by sourceware.org (Postfix) with ESMTPS id CF6A0385842A for ; Tue, 28 Dec 2021 20:11:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CF6A0385842A X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="241218099" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="241218099" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="666095348" Received: from scymds01.sc.intel.com ([10.148.94.138]) by fmsmga001.fm.intel.com with ESMTP; 28 Dec 2021 12:11:32 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUse016522; Tue, 28 Dec 2021 12:11:31 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 10/18] x86-64: Add vector atan2/atan2f implementation to libmvec Date: Tue, 28 Dec 2021 12:11:22 -0800 Message-Id: <20211228201130.737370-11-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized atan2/atan2f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector atan2/atan2f with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_atan22_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_atan22_core.c | 28 ++ .../fpu/multiarch/svml_d_atan22_core_sse4.S | 471 +++++++++++++++++ .../fpu/multiarch/svml_d_atan24_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_atan24_core.c | 28 ++ .../fpu/multiarch/svml_d_atan24_core_avx2.S | 451 +++++++++++++++++ .../fpu/multiarch/svml_d_atan28_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_atan28_core.c | 28 ++ .../fpu/multiarch/svml_d_atan28_core_avx512.S | 475 ++++++++++++++++++ .../fpu/multiarch/svml_s_atan2f16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_atan2f16_core.c | 28 ++ .../multiarch/svml_s_atan2f16_core_avx512.S | 399 +++++++++++++++ .../fpu/multiarch/svml_s_atan2f4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_atan2f4_core.c | 28 ++ .../fpu/multiarch/svml_s_atan2f4_core_sse4.S | 384 ++++++++++++++ .../fpu/multiarch/svml_s_atan2f8_core-sse.S | 20 + .../fpu/multiarch/svml_s_atan2f8_core.c | 28 ++ .../fpu/multiarch/svml_s_atan2f8_core_avx2.S | 362 +++++++++++++ sysdeps/x86_64/fpu/svml_d_atan22_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_atan24_core.S | 29 ++ sysdeps/x86_64/fpu/svml_d_atan24_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_atan28_core.S | 25 + sysdeps/x86_64/fpu/svml_s_atan2f16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_atan2f4_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_atan2f8_core.S | 29 ++ sysdeps/x86_64/fpu/svml_s_atan2f8_core_avx.S | 25 + .../fpu/test-double-libmvec-atan2-avx.c | 1 + .../fpu/test-double-libmvec-atan2-avx2.c | 1 + .../fpu/test-double-libmvec-atan2-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-atan2.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-atan2f-avx.c | 1 + .../fpu/test-float-libmvec-atan2f-avx2.c | 1 + .../fpu/test-float-libmvec-atan2f-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-atan2f.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 3117 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan22_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan24_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan24_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atan28_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atan2f16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atan2f4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atan2f8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atan2f8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atan2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atan2f.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 7f1304ed1d..31878bf4ed 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -208,4 +208,15 @@ #define __DECL_SIMD_cbrtf32x #define __DECL_SIMD_cbrtf64x #define __DECL_SIMD_cbrtf128x + +#define __DECL_SIMD_atan2 +#define __DECL_SIMD_atan2f +#define __DECL_SIMD_atan2l +#define __DECL_SIMD_atan2f16 +#define __DECL_SIMD_atan2f32 +#define __DECL_SIMD_atan2f64 +#define __DECL_SIMD_atan2f128 +#define __DECL_SIMD_atan2f32x +#define __DECL_SIMD_atan2f64x +#define __DECL_SIMD_atan2f128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 26d18f0135..1bd4911993 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -56,7 +56,7 @@ __MATHCALL_VEC (asin,, (_Mdouble_ __x)); /* Arc tangent of X. */ __MATHCALL_VEC (atan,, (_Mdouble_ __x)); /* Arc tangent of Y/X. */ -__MATHCALL (atan2,, (_Mdouble_ __y, _Mdouble_ __x)); +__MATHCALL_VEC (atan2,, (_Mdouble_ __y, _Mdouble_ __x)); /* Cosine of X. */ __MATHCALL_VEC (cos,, (_Mdouble_ __x)); diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index a6558d9810..2b3b8d3886 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -55,6 +55,7 @@ GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2v_expm1 F GLIBC_2.35 _ZGVbN2v_sinh F +GLIBC_2.35 _ZGVbN2vv_atan2 F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F @@ -65,6 +66,7 @@ GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4v_expm1f F GLIBC_2.35 _ZGVbN4v_sinhf F +GLIBC_2.35 _ZGVbN4vv_atan2f F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F @@ -75,6 +77,7 @@ GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4v_expm1 F GLIBC_2.35 _ZGVcN4v_sinh F +GLIBC_2.35 _ZGVcN4vv_atan2 F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F @@ -85,6 +88,7 @@ GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8v_expm1f F GLIBC_2.35 _ZGVcN8v_sinhf F +GLIBC_2.35 _ZGVcN8vv_atan2f F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F @@ -95,6 +99,7 @@ GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4v_expm1 F GLIBC_2.35 _ZGVdN4v_sinh F +GLIBC_2.35 _ZGVdN4vv_atan2 F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F @@ -105,6 +110,7 @@ GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8v_expm1f F GLIBC_2.35 _ZGVdN8v_sinhf F +GLIBC_2.35 _ZGVdN8vv_atan2f F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F @@ -115,6 +121,7 @@ GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16v_expm1f F GLIBC_2.35 _ZGVeN16v_sinhf F +GLIBC_2.35 _ZGVeN16vv_atan2f F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F @@ -125,4 +132,5 @@ GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8v_expm1 F GLIBC_2.35 _ZGVeN8v_sinh F +GLIBC_2.35 _ZGVeN8vv_atan2 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index dcd45934ab..62f2890ab3 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -98,6 +98,10 @@ # define __DECL_SIMD_cbrt __DECL_SIMD_x86_64 # undef __DECL_SIMD_cbrtf # define __DECL_SIMD_cbrtf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_atan2 +# define __DECL_SIMD_atan2 __DECL_SIMD_x86_64 +# undef __DECL_SIMD_atan2f +# define __DECL_SIMD_atan2f __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index dfb5f13ea3..2269b74d50 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -48,6 +48,8 @@ !GCC$ builtin (sinhf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cbrt) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cbrtf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (atan2) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (atan2f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -81,3 +83,5 @@ !GCC$ builtin (sinhf) attributes simd (notinbranch) if('x32') !GCC$ builtin (cbrt) attributes simd (notinbranch) if('x32') !GCC$ builtin (cbrtf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (atan2) attributes simd (notinbranch) if('x32') +!GCC$ builtin (atan2f) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index dde737c0d6..96a40856fa 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -25,6 +25,7 @@ libmvec-funcs = \ acos \ asin \ atan \ + atan2 \ cbrt \ cos \ cosh \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index b70aeb3e2f..f58c98eb45 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -23,6 +23,7 @@ libmvec { _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2v_expm1; _ZGVcN4v_expm1; _ZGVdN4v_expm1; _ZGVeN8v_expm1; _ZGVbN2v_sinh; _ZGVcN4v_sinh; _ZGVdN4v_sinh; _ZGVeN8v_sinh; + _ZGVbN2vv_atan2; _ZGVcN4vv_atan2; _ZGVdN4vv_atan2; _ZGVeN8vv_atan2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; @@ -33,6 +34,7 @@ libmvec { _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4v_expm1f; _ZGVcN8v_expm1f; _ZGVdN8v_expm1f; _ZGVeN16v_expm1f; _ZGVbN4v_sinhf; _ZGVcN8v_sinhf; _ZGVdN8v_sinhf; _ZGVeN16v_sinhf; + _ZGVbN4vv_atan2f; _ZGVcN8vv_atan2f; _ZGVdN8vv_atan2f; _ZGVeN16vv_atan2f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; } } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index e039a993df..6f59c61756 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -166,6 +166,26 @@ float: 2 float128: 2 ldouble: 1 +Function: "atan2_vlen16": +float: 2 + +Function: "atan2_vlen2": +double: 1 + +Function: "atan2_vlen4": +double: 1 +float: 2 + +Function: "atan2_vlen4_avx2": +double: 1 + +Function: "atan2_vlen8": +double: 1 +float: 2 + +Function: "atan2_vlen8_avx2": +float: 2 + Function: "atan_downward": double: 1 float: 2 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core-sse2.S new file mode 100644 index 0000000000..6c3ad05a6c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized atan2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2vv_atan2 _ZGVbN2vv_atan2_sse2 +#include "../svml_d_atan22_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core.c new file mode 100644 index 0000000000..43f1ee7f33 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atan2, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2vv_atan2 +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2vv_atan2, __GI__ZGVbN2vv_atan2, + __redirect__ZGVbN2vv_atan2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core_sse4.S new file mode 100644 index 0000000000..5c0d0fd17f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan22_core_sse4.S @@ -0,0 +1,471 @@ +/* Function atan2 vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + * + */ + +/* Offsets for data table __svml_datan2_data_internal + */ +#define dPI 0 +#define dPIO2 16 +#define dA19 32 +#define dA18 48 +#define dA17 64 +#define dA16 80 +#define dA15 96 +#define dA14 112 +#define dA13 128 +#define dA12 144 +#define dA11 160 +#define dA10 176 +#define dA09 192 +#define dA08 208 +#define dA07 224 +#define dA06 240 +#define dA05 256 +#define dA04 272 +#define dA03 288 +#define dA02 304 +#define dA01 320 +#define dA00 336 +#define dSIGN_MASK 352 +#define iCHK_WORK_SUB 368 +#define iCHK_WORK_CMP 384 +#define dABS_MASK 400 +#define dZERO 416 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2vv_atan2_sse4) + subq $88, %rsp + cfi_def_cfa_offset(96) + movaps %xmm0, %xmm8 + +/* + * #define NO_VECTOR_ZERO_ATAN2_ARGS + * Declarations + * Variables + * Constants + * The end of declarations + * Implementation + * Get r0~=1/B + * Cannot be replaced by VQRCP(D, dR0, dB); + * Argument Absolute values + */ + movups dABS_MASK+__svml_datan2_data_internal(%rip), %xmm4 + movaps %xmm1, %xmm9 + movaps %xmm4, %xmm1 + andps %xmm8, %xmm4 + andps %xmm9, %xmm1 + movaps %xmm4, %xmm2 + cmpnltpd %xmm1, %xmm2 + +/* Argument signs */ + movups dSIGN_MASK+__svml_datan2_data_internal(%rip), %xmm3 + movaps %xmm2, %xmm0 + movups dPIO2+__svml_datan2_data_internal(%rip), %xmm5 + movaps %xmm3, %xmm7 + movaps %xmm3, %xmm6 + +/* + * 1) If yx then a=-x, b=y, PIO2=Pi/2 + */ + orps %xmm1, %xmm3 + movaps %xmm2, %xmm10 + andps %xmm2, %xmm5 + andnps %xmm4, %xmm0 + andps %xmm2, %xmm3 + andnps %xmm1, %xmm10 + andps %xmm4, %xmm2 + orps %xmm3, %xmm0 + orps %xmm2, %xmm10 + divpd %xmm10, %xmm0 + movq iCHK_WORK_SUB+__svml_datan2_data_internal(%rip), %xmm11 + +/* if x<0, dPI = Pi, else dPI =0 */ + movaps %xmm9, %xmm3 + +/* Check if y and x are on main path. */ + pshufd $221, %xmm1, %xmm12 + andps %xmm9, %xmm7 + psubd %xmm11, %xmm12 + andps %xmm8, %xmm6 + movq iCHK_WORK_CMP+__svml_datan2_data_internal(%rip), %xmm13 + xorl %edx, %edx + movups %xmm4, 16(%rsp) + xorl %eax, %eax + pshufd $221, %xmm4, %xmm14 + movdqa %xmm12, %xmm4 + pcmpgtd %xmm13, %xmm4 + pcmpeqd %xmm13, %xmm12 + por %xmm12, %xmm4 + +/* Polynomial. */ + movaps %xmm0, %xmm12 + mulpd %xmm0, %xmm12 + cmplepd dZERO+__svml_datan2_data_internal(%rip), %xmm3 + psubd %xmm11, %xmm14 + movdqa %xmm14, %xmm15 + pcmpeqd %xmm13, %xmm14 + pcmpgtd %xmm13, %xmm15 + por %xmm14, %xmm15 + movaps %xmm12, %xmm14 + mulpd %xmm12, %xmm14 + por %xmm15, %xmm4 + movaps %xmm14, %xmm15 + mulpd %xmm14, %xmm15 + movmskps %xmm4, %ecx + movups %xmm10, (%rsp) + movups dA19+__svml_datan2_data_internal(%rip), %xmm10 + mulpd %xmm15, %xmm10 + movups dA18+__svml_datan2_data_internal(%rip), %xmm13 + movups dA17+__svml_datan2_data_internal(%rip), %xmm11 + addpd dA15+__svml_datan2_data_internal(%rip), %xmm10 + mulpd %xmm15, %xmm13 + mulpd %xmm15, %xmm11 + mulpd %xmm15, %xmm10 + addpd dA14+__svml_datan2_data_internal(%rip), %xmm13 + addpd dA13+__svml_datan2_data_internal(%rip), %xmm11 + addpd dA11+__svml_datan2_data_internal(%rip), %xmm10 + mulpd %xmm15, %xmm13 + mulpd %xmm15, %xmm11 + mulpd %xmm15, %xmm10 + addpd dA10+__svml_datan2_data_internal(%rip), %xmm13 + addpd dA09+__svml_datan2_data_internal(%rip), %xmm11 + addpd dA07+__svml_datan2_data_internal(%rip), %xmm10 + mulpd %xmm15, %xmm13 + mulpd %xmm15, %xmm11 + mulpd %xmm15, %xmm10 + addpd dA06+__svml_datan2_data_internal(%rip), %xmm13 + addpd dA05+__svml_datan2_data_internal(%rip), %xmm11 + addpd dA03+__svml_datan2_data_internal(%rip), %xmm10 + mulpd %xmm15, %xmm13 + mulpd %xmm15, %xmm11 + mulpd %xmm12, %xmm10 + addpd dA02+__svml_datan2_data_internal(%rip), %xmm13 + addpd dA01+__svml_datan2_data_internal(%rip), %xmm11 + addpd %xmm10, %xmm13 + mulpd %xmm11, %xmm12 + mulpd %xmm13, %xmm14 + movups dA16+__svml_datan2_data_internal(%rip), %xmm2 + mulpd %xmm15, %xmm2 + addpd dA12+__svml_datan2_data_internal(%rip), %xmm2 + mulpd %xmm15, %xmm2 + addpd dA08+__svml_datan2_data_internal(%rip), %xmm2 + mulpd %xmm15, %xmm2 + addpd dA04+__svml_datan2_data_internal(%rip), %xmm2 + +/* A00=1.0, account for it later VQFMA(D, dP4, dP4, dR8, dA00); */ + mulpd %xmm2, %xmm15 + addpd %xmm12, %xmm15 + addpd %xmm14, %xmm15 + +/* + * Reconstruction. + * dP=(R+R*dP) + dPIO2 + */ + mulpd %xmm0, %xmm15 + addpd %xmm15, %xmm0 + addpd %xmm5, %xmm0 + andps __svml_datan2_data_internal(%rip), %xmm3 + orps %xmm7, %xmm0 + addpd %xmm3, %xmm0 + +/* Special branch for fast (vector) processing of zero arguments */ + movups 16(%rsp), %xmm11 + orps %xmm6, %xmm0 + testb $3, %cl + +/* Go to auxilary branch */ + jne L(AUX_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 xmm1 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm11 + +/* Return from auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH_RETURN): +/* + * Special branch for fast (vector) processing of zero arguments + * The end of implementation + */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 xmm8 xmm9 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $88, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(96) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm8, 32(%rsp) + movups %xmm9, 48(%rsp) + movups %xmm0, 64(%rsp) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 + + movq %r12, 16(%rsp) + cfi_offset(12, -80) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -88) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -96) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 64(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -80) + cfi_offset(13, -88) + cfi_offset(14, -96) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + movsd 48(%rsp,%r14,8), %xmm1 + call atan2@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + cfi_restore(12) + cfi_restore(13) + cfi_restore(14) + # LOE rbx rbp r15 r12d r13d + +/* Auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH): +/* Check if at least on of Y or Y is zero: iAXAYZERO */ + movups dZERO+__svml_datan2_data_internal(%rip), %xmm2 + +/* Check if both X & Y are not NaNs: iXYnotNAN */ + movaps %xmm9, %xmm12 + movaps %xmm8, %xmm10 + cmpordpd %xmm9, %xmm12 + cmpordpd %xmm8, %xmm10 + cmpeqpd %xmm2, %xmm1 + cmpeqpd %xmm2, %xmm11 + andps %xmm10, %xmm12 + orps %xmm11, %xmm1 + pshufd $221, %xmm1, %xmm1 + pshufd $221, %xmm12, %xmm11 + +/* Check if at least on of Y or Y is zero and not NaN: iAXAYZEROnotNAN */ + pand %xmm11, %xmm1 + +/* Exclude from previous callout mask zero (and not NaN) arguments */ + movdqa %xmm1, %xmm13 + pandn %xmm4, %xmm13 + +/* + * Path for zero arguments (at least one of both) + * Check if both args are zeros (den. is zero) + */ + movups (%rsp), %xmm4 + cmpeqpd %xmm2, %xmm4 + +/* Go to callout */ + movmskps %xmm13, %edx + +/* Set sPIO2 to zero if den. is zero */ + movaps %xmm4, %xmm15 + andps %xmm2, %xmm4 + andnps %xmm5, %xmm15 + andl $3, %edx + orps %xmm4, %xmm15 + pshufd $221, %xmm9, %xmm5 + orps %xmm7, %xmm15 + +/* Res = sign(Y)*(X<0)?(PIO2+PI):PIO2 */ + pshufd $221, %xmm2, %xmm7 + pcmpgtd %xmm5, %xmm7 + pshufd $80, %xmm7, %xmm14 + andps %xmm3, %xmm14 + addpd %xmm14, %xmm15 + +/* Merge results from main and spec path */ + pshufd $80, %xmm1, %xmm3 + orps %xmm6, %xmm15 + movdqa %xmm3, %xmm6 + andps %xmm3, %xmm15 + andnps %xmm0, %xmm6 + movaps %xmm6, %xmm0 + orps %xmm15, %xmm0 + +/* Return to main vector processing path */ + jmp L(AUX_BRANCH_RETURN) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 xmm8 xmm9 +END(_ZGVbN2vv_atan2_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_datan2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 dPI[2][2]; + __declspec(align(16)) VUINT32 dPIO2[2][2]; + __declspec(align(16)) VUINT32 dA19[2][2]; + __declspec(align(16)) VUINT32 dA18[2][2]; + __declspec(align(16)) VUINT32 dA17[2][2]; + __declspec(align(16)) VUINT32 dA16[2][2]; + __declspec(align(16)) VUINT32 dA15[2][2]; + __declspec(align(16)) VUINT32 dA14[2][2]; + __declspec(align(16)) VUINT32 dA13[2][2]; + __declspec(align(16)) VUINT32 dA12[2][2]; + __declspec(align(16)) VUINT32 dA11[2][2]; + __declspec(align(16)) VUINT32 dA10[2][2]; + __declspec(align(16)) VUINT32 dA09[2][2]; + __declspec(align(16)) VUINT32 dA08[2][2]; + __declspec(align(16)) VUINT32 dA07[2][2]; + __declspec(align(16)) VUINT32 dA06[2][2]; + __declspec(align(16)) VUINT32 dA05[2][2]; + __declspec(align(16)) VUINT32 dA04[2][2]; + __declspec(align(16)) VUINT32 dA03[2][2]; + __declspec(align(16)) VUINT32 dA02[2][2]; + __declspec(align(16)) VUINT32 dA01[2][2]; + __declspec(align(16)) VUINT32 dA00[2][2]; + __declspec(align(16)) VUINT32 dSIGN_MASK[2][2]; + __declspec(align(16)) VUINT32 iCHK_WORK_SUB[4][1]; + __declspec(align(16)) VUINT32 iCHK_WORK_CMP[4][1]; + __declspec(align(16)) VUINT32 dABS_MASK[2][2]; + __declspec(align(16)) VUINT32 dZERO[2][2]; +} __svml_datan2_data_internal; +#endif +__svml_datan2_data_internal: + .quad 0x400921FB54442D18, 0x400921FB54442D18 //dPI + .align 16 + .quad 0x3FF921FB54442D18, 0x3FF921FB54442D18 //dPIO2 + .align 16 + .quad 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3 // dA19 + .align 16 + .quad 0x3F2CED0A36665209, 0x3F2CED0A36665209 // dA18 + .align 16 + .quad 0xBF52E67C93954C23, 0xBF52E67C93954C23 // dA17 + .align 16 + .quad 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3 // dA16 + .align 16 + .quad 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD // dA15 + .align 16 + .quad 0x3F914F4C661116A5, 0x3F914F4C661116A5 // dA14 + .align 16 + .quad 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C // dA13 + .align 16 + .quad 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F // dA12 + .align 16 + .quad 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC // dA11 + .align 16 + .quad 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B // dA10 + .align 16 + .quad 0xBFAAD261EAA09954, 0xBFAAD261EAA09954 // dA09 + .align 16 + .quad 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF // dA08 + .align 16 + .quad 0xBFB11084009435E0, 0xBFB11084009435E0 // dA07 + .align 16 + .quad 0x3FB3B12A49295651, 0x3FB3B12A49295651 // dA06 + .align 16 + .quad 0xBFB745D009BADA94, 0xBFB745D009BADA94 // dA05 + .align 16 + .quad 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5 // dA04 + .align 16 + .quad 0xBFC2492491EE55C7, 0xBFC2492491EE55C7 // dA03 + .align 16 + .quad 0x3FC999999997EE34, 0x3FC999999997EE34 // dA02 + .align 16 + .quad 0xBFD55555555553C5, 0xBFD55555555553C5 // dA01 + .align 16 + .quad 0x3FF0000000000000, 0x3FF0000000000000 // dA00 + .align 16 + .quad 0x8000000000000000, 0x8000000000000000 //dSIGN_MASK + .align 16 + .long 0x80300000, 0x80300000, 0x80300000, 0x80300000 //iCHK_WORK_SUB + .align 16 + .long 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000 //iCHK_WORK_CMP + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff //dABS_MASK + .align 16 + .quad 0x0000000000000000, 0x0000000000000000 //dZERO + .align 16 + .type __svml_datan2_data_internal,@object + .size __svml_datan2_data_internal,.-__svml_datan2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core-sse.S new file mode 100644 index 0000000000..0db843a088 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized atan2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4vv_atan2 _ZGVdN4vv_atan2_sse_wrapper +#include "../svml_d_atan24_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core.c new file mode 100644 index 0000000000..c2e2611584 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atan2, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4vv_atan2 +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4vv_atan2, __GI__ZGVdN4vv_atan2, + __redirect__ZGVdN4vv_atan2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core_avx2.S new file mode 100644 index 0000000000..cdf780715b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan24_core_avx2.S @@ -0,0 +1,451 @@ +/* Function atan2 vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + * + */ + +/* Offsets for data table __svml_datan2_data_internal + */ +#define dPI 0 +#define dPIO2 32 +#define dA19 64 +#define dA18 96 +#define dA17 128 +#define dA16 160 +#define dA15 192 +#define dA14 224 +#define dA13 256 +#define dA12 288 +#define dA11 320 +#define dA10 352 +#define dA09 384 +#define dA08 416 +#define dA07 448 +#define dA06 480 +#define dA05 512 +#define dA04 544 +#define dA03 576 +#define dA02 608 +#define dA01 640 +#define dA00 672 +#define dSIGN_MASK 704 +#define iCHK_WORK_SUB 736 +#define iCHK_WORK_CMP 768 +#define dABS_MASK 800 +#define dZERO 832 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4vv_atan2_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $128, %rsp + xorl %edx, %edx + +/* + * #define NO_VECTOR_ZERO_ATAN2_ARGS + * Declarations + * Variables + * Constants + * The end of declarations + * Implementation + * Get r0~=1/B + * Cannot be replaced by VQRCP(D, dR0, dB); + * Argument Absolute values + */ + vmovupd dABS_MASK+__svml_datan2_data_internal(%rip), %ymm5 + +/* Argument signs */ + vmovupd dSIGN_MASK+__svml_datan2_data_internal(%rip), %ymm4 + vmovups iCHK_WORK_SUB+__svml_datan2_data_internal(%rip), %xmm13 + vmovupd %ymm0, (%rsp) + vmovapd %ymm1, %ymm8 + vandpd %ymm5, %ymm8, %ymm2 + vandpd %ymm5, %ymm0, %ymm1 + vcmpnlt_uqpd %ymm2, %ymm1, %ymm15 + +/* + * 1) If yx then a=-x, b=y, PIO2=Pi/2 + */ + vorpd %ymm4, %ymm2, %ymm6 + vblendvpd %ymm15, %ymm6, %ymm1, %ymm3 + vblendvpd %ymm15, %ymm1, %ymm2, %ymm6 + vdivpd %ymm6, %ymm3, %ymm14 + vmovups iCHK_WORK_CMP+__svml_datan2_data_internal(%rip), %xmm3 + vmovupd %ymm6, 32(%rsp) + vandpd %ymm4, %ymm0, %ymm7 + vandpd %ymm4, %ymm8, %ymm5 + vandpd dPIO2+__svml_datan2_data_internal(%rip), %ymm15, %ymm4 + +/* Check if y and x are on main path. */ + vextractf128 $1, %ymm2, %xmm9 + vextractf128 $1, %ymm1, %xmm10 + vshufps $221, %xmm9, %xmm2, %xmm11 + vshufps $221, %xmm10, %xmm1, %xmm12 + vpsubd %xmm13, %xmm11, %xmm0 + vpsubd %xmm13, %xmm12, %xmm9 + vpcmpgtd %xmm3, %xmm0, %xmm15 + vpcmpeqd %xmm3, %xmm0, %xmm6 + vpcmpgtd %xmm3, %xmm9, %xmm10 + vpcmpeqd %xmm3, %xmm9, %xmm3 + vpor %xmm6, %xmm15, %xmm11 + vpor %xmm3, %xmm10, %xmm12 + +/* Polynomial. */ + vmulpd %ymm14, %ymm14, %ymm10 + vpor %xmm12, %xmm11, %xmm3 + vmovupd dA18+__svml_datan2_data_internal(%rip), %ymm9 + vmovupd dA17+__svml_datan2_data_internal(%rip), %ymm12 + vmovupd dA16+__svml_datan2_data_internal(%rip), %ymm15 + vmulpd %ymm10, %ymm10, %ymm11 + +/* if x<0, dPI = Pi, else dPI =0 */ + vcmple_oqpd dZERO+__svml_datan2_data_internal(%rip), %ymm8, %ymm13 + vmovmskps %xmm3, %eax + vmulpd %ymm11, %ymm11, %ymm0 + vandpd __svml_datan2_data_internal(%rip), %ymm13, %ymm6 + vmovupd dA19+__svml_datan2_data_internal(%rip), %ymm13 + vfmadd213pd dA14+__svml_datan2_data_internal(%rip), %ymm0, %ymm9 + vfmadd213pd dA13+__svml_datan2_data_internal(%rip), %ymm0, %ymm12 + vfmadd213pd dA12+__svml_datan2_data_internal(%rip), %ymm0, %ymm15 + vfmadd213pd dA15+__svml_datan2_data_internal(%rip), %ymm0, %ymm13 + vfmadd213pd dA10+__svml_datan2_data_internal(%rip), %ymm0, %ymm9 + vfmadd213pd dA09+__svml_datan2_data_internal(%rip), %ymm0, %ymm12 + vfmadd213pd dA08+__svml_datan2_data_internal(%rip), %ymm0, %ymm15 + vfmadd213pd dA11+__svml_datan2_data_internal(%rip), %ymm0, %ymm13 + vfmadd213pd dA06+__svml_datan2_data_internal(%rip), %ymm0, %ymm9 + vfmadd213pd dA05+__svml_datan2_data_internal(%rip), %ymm0, %ymm12 + vfmadd213pd dA04+__svml_datan2_data_internal(%rip), %ymm0, %ymm15 + vfmadd213pd dA07+__svml_datan2_data_internal(%rip), %ymm0, %ymm13 + vfmadd213pd dA02+__svml_datan2_data_internal(%rip), %ymm0, %ymm9 + vfmadd213pd dA01+__svml_datan2_data_internal(%rip), %ymm0, %ymm12 + vfmadd213pd dA03+__svml_datan2_data_internal(%rip), %ymm0, %ymm13 + +/* A00=1.0, account for it later VQFMA(D, dP4, dP4, dR8, dA00); */ + vmulpd %ymm15, %ymm0, %ymm0 + vfmadd213pd %ymm9, %ymm10, %ymm13 + vfmadd213pd %ymm0, %ymm10, %ymm12 + vfmadd213pd %ymm12, %ymm11, %ymm13 + +/* + * Reconstruction. + * dP=(R+R*dP) + dPIO2 + */ + vfmadd213pd %ymm14, %ymm14, %ymm13 + vaddpd %ymm13, %ymm4, %ymm14 + vorpd %ymm5, %ymm14, %ymm0 + vaddpd %ymm0, %ymm6, %ymm9 + vorpd %ymm7, %ymm9, %ymm0 + +/* Special branch for fast (vector) processing of zero arguments */ + testl %eax, %eax + +/* Go to auxilary branch */ + jne L(AUX_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm3 ymm0 ymm1 ymm2 ymm4 ymm5 ymm6 ymm7 ymm8 + +/* Return from auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH_RETURN): +/* + * Special branch for fast (vector) processing of zero arguments + * The end of implementation + */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm8 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd (%rsp), %ymm1 + vmovupd %ymm8, 64(%rsp) + vmovupd %ymm0, 96(%rsp) + vmovupd %ymm1, 32(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 96(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + movsd 64(%rsp,%r14,8), %xmm1 + call atan2@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 96(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + cfi_restore(12) + cfi_restore(13) + cfi_restore(14) + # LOE rbx r15 r12d r13d + +/* Auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH): + vmovupd (%rsp), %ymm11 + +/* Check if at least on of Y or Y is zero: iAXAYZERO */ + vmovupd dZERO+__svml_datan2_data_internal(%rip), %ymm10 + +/* Check if both X & Y are not NaNs: iXYnotNAN */ + vcmpordpd %ymm8, %ymm8, %ymm12 + vcmpordpd %ymm11, %ymm11, %ymm13 + vcmpeqpd %ymm10, %ymm2, %ymm2 + vcmpeqpd %ymm10, %ymm1, %ymm1 + vandpd %ymm13, %ymm12, %ymm14 + vorpd %ymm1, %ymm2, %ymm2 + vextractf128 $1, %ymm14, %xmm15 + vextractf128 $1, %ymm2, %xmm11 + vshufps $221, %xmm15, %xmm14, %xmm9 + vshufps $221, %xmm11, %xmm2, %xmm12 + +/* + * Path for zero arguments (at least one of both) + * Check if both args are zeros (den. is zero) + */ + vcmpeqpd 32(%rsp), %ymm10, %ymm2 + +/* Check if at least on of Y or Y is zero and not NaN: iAXAYZEROnotNAN */ + vpand %xmm9, %xmm12, %xmm1 + +/* Exclude from previous callout mask zero (and not NaN) arguments */ + vpandn %xmm3, %xmm1, %xmm3 + +/* Go to callout */ + vmovmskps %xmm3, %edx + +/* Set sPIO2 to zero if den. is zero */ + vblendvpd %ymm2, %ymm10, %ymm4, %ymm4 + vorpd %ymm5, %ymm4, %ymm5 + +/* Res = sign(Y)*(X<0)?(PIO2+PI):PIO2 */ + vextractf128 $1, %ymm10, %xmm2 + vextractf128 $1, %ymm8, %xmm3 + vshufps $221, %xmm2, %xmm10, %xmm4 + vshufps $221, %xmm3, %xmm8, %xmm9 + vpcmpgtd %xmm9, %xmm4, %xmm12 + vpshufd $80, %xmm12, %xmm11 + vpshufd $250, %xmm12, %xmm13 + vinsertf128 $1, %xmm13, %ymm11, %ymm14 + vandpd %ymm6, %ymm14, %ymm6 + vaddpd %ymm6, %ymm5, %ymm2 + vorpd %ymm7, %ymm2, %ymm2 + +/* Merge results from main and spec path */ + vpshufd $80, %xmm1, %xmm7 + vpshufd $250, %xmm1, %xmm1 + vinsertf128 $1, %xmm1, %ymm7, %ymm3 + vblendvpd %ymm3, %ymm2, %ymm0, %ymm0 + +/* Return to main vector processing path */ + jmp L(AUX_BRANCH_RETURN) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm8 +END(_ZGVdN4vv_atan2_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_datan2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 dPI[4][2]; + __declspec(align(32)) VUINT32 dPIO2[4][2]; + __declspec(align(32)) VUINT32 dA19[4][2]; + __declspec(align(32)) VUINT32 dA18[4][2]; + __declspec(align(32)) VUINT32 dA17[4][2]; + __declspec(align(32)) VUINT32 dA16[4][2]; + __declspec(align(32)) VUINT32 dA15[4][2]; + __declspec(align(32)) VUINT32 dA14[4][2]; + __declspec(align(32)) VUINT32 dA13[4][2]; + __declspec(align(32)) VUINT32 dA12[4][2]; + __declspec(align(32)) VUINT32 dA11[4][2]; + __declspec(align(32)) VUINT32 dA10[4][2]; + __declspec(align(32)) VUINT32 dA09[4][2]; + __declspec(align(32)) VUINT32 dA08[4][2]; + __declspec(align(32)) VUINT32 dA07[4][2]; + __declspec(align(32)) VUINT32 dA06[4][2]; + __declspec(align(32)) VUINT32 dA05[4][2]; + __declspec(align(32)) VUINT32 dA04[4][2]; + __declspec(align(32)) VUINT32 dA03[4][2]; + __declspec(align(32)) VUINT32 dA02[4][2]; + __declspec(align(32)) VUINT32 dA01[4][2]; + __declspec(align(32)) VUINT32 dA00[4][2]; + __declspec(align(32)) VUINT32 dSIGN_MASK[4][2]; + __declspec(align(32)) VUINT32 iCHK_WORK_SUB[8][1]; + __declspec(align(32)) VUINT32 iCHK_WORK_CMP[8][1]; + __declspec(align(32)) VUINT32 dABS_MASK[4][2]; + __declspec(align(32)) VUINT32 dZERO[4][2]; +} __svml_datan2_data_internal; +#endif +__svml_datan2_data_internal: + .quad 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18 //dPI + .align 32 + .quad 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18 //dPIO2 + .align 32 + .quad 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3 // dA19 + .align 32 + .quad 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209 // dA18 + .align 32 + .quad 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23 // dA17 + .align 32 + .quad 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3 // dA16 + .align 32 + .quad 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD // dA15 + .align 32 + .quad 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5 // dA14 + .align 32 + .quad 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C // dA13 + .align 32 + .quad 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F // dA12 + .align 32 + .quad 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC // dA11 + .align 32 + .quad 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B // dA10 + .align 32 + .quad 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954 // dA09 + .align 32 + .quad 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF // dA08 + .align 32 + .quad 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0 // dA07 + .align 32 + .quad 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651 // dA06 + .align 32 + .quad 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94 // dA05 + .align 32 + .quad 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5 // dA04 + .align 32 + .quad 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7 // dA03 + .align 32 + .quad 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34 // dA02 + .align 32 + .quad 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5 // dA01 + .align 32 + .quad 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000 // dA00 + .align 32 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 //dSIGN_MASK + .align 32 + .long 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000 //iCHK_WORK_SUB + .align 32 + .long 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000 //iCHK_WORK_CMP + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff //dABS_MASK + .align 32 + .quad 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000 //dZERO + .align 32 + .type __svml_datan2_data_internal,@object + .size __svml_datan2_data_internal,.-__svml_datan2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core-avx2.S new file mode 100644 index 0000000000..a8d34a6143 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized atan2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8vv_atan2 _ZGVeN8vv_atan2_avx2_wrapper +#include "../svml_d_atan28_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core.c new file mode 100644 index 0000000000..a0897e9cf0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atan2, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8vv_atan2 +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8vv_atan2, __GI__ZGVeN8vv_atan2, + __redirect__ZGVeN8vv_atan2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core_avx512.S new file mode 100644 index 0000000000..6d18f5f757 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atan28_core_avx512.S @@ -0,0 +1,475 @@ +/* Function atan2 vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + * + */ + +/* Offsets for data table __svml_datan2_data_internal + */ +#define dPI 0 +#define dPIO2 64 +#define dA19 128 +#define dA18 192 +#define dA17 256 +#define dA16 320 +#define dA15 384 +#define dA14 448 +#define dA13 512 +#define dA12 576 +#define dA11 640 +#define dA10 704 +#define dA09 768 +#define dA08 832 +#define dA07 896 +#define dA06 960 +#define dA05 1024 +#define dA04 1088 +#define dA03 1152 +#define dA02 1216 +#define dA01 1280 +#define dA00 1344 +#define dSIGN_MASK 1408 +#define iCHK_WORK_SUB 1472 +#define iCHK_WORK_CMP 1536 +#define dABS_MASK 1600 +#define dZERO 1664 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8vv_atan2_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $256, %rsp + xorl %edx, %edx + +/* + * #define NO_VECTOR_ZERO_ATAN2_ARGS + * Declarations + * Variables + * Constants + * The end of declarations + * Implementation + * Get r0~=1/B + * Cannot be replaced by VQRCP(D, dR0, dB); + * Argument Absolute values + */ + vmovups dABS_MASK+__svml_datan2_data_internal(%rip), %zmm4 + +/* Argument signs */ + vmovups dSIGN_MASK+__svml_datan2_data_internal(%rip), %zmm6 + +/* + * 1) If yx then a=-x, b=y, PIO2=Pi/2 + */ + vmovups dPIO2+__svml_datan2_data_internal(%rip), %zmm3 + vandpd %zmm4, %zmm0, %zmm11 + vmovaps %zmm1, %zmm7 + vandpd %zmm4, %zmm7, %zmm2 + vandpd %zmm6, %zmm7, %zmm5 + vandpd %zmm6, %zmm0, %zmm4 + vorpd %zmm6, %zmm2, %zmm12 + vcmppd $17, {sae}, %zmm2, %zmm11, %k1 + vmovdqu iCHK_WORK_CMP+__svml_datan2_data_internal(%rip), %ymm6 + vmovups %zmm11, 64(%rsp) + +/* Check if y and x are on main path. */ + vpsrlq $32, %zmm2, %zmm9 + vblendmpd %zmm11, %zmm12, %zmm13{%k1} + vblendmpd %zmm2, %zmm11, %zmm15{%k1} + vpsrlq $32, %zmm11, %zmm8 + vmovdqu iCHK_WORK_SUB+__svml_datan2_data_internal(%rip), %ymm12 + vdivpd {rn-sae}, %zmm15, %zmm13, %zmm1 + vmovups %zmm15, (%rsp) + vpmovqd %zmm9, %ymm14 + vpmovqd %zmm8, %ymm10 + vxorpd %zmm3, %zmm3, %zmm3{%k1} + vpsubd %ymm12, %ymm14, %ymm13 + vpsubd %ymm12, %ymm10, %ymm9 + +/* Polynomial. */ + vmulpd {rn-sae}, %zmm1, %zmm1, %zmm12 + vpcmpgtd %ymm6, %ymm13, %ymm15 + vpcmpeqd %ymm6, %ymm13, %ymm11 + vmulpd {rn-sae}, %zmm12, %zmm12, %zmm13 + vpor %ymm11, %ymm15, %ymm8 + vmovups dA19+__svml_datan2_data_internal(%rip), %zmm11 + vmovups dA15+__svml_datan2_data_internal(%rip), %zmm15 + vpcmpgtd %ymm6, %ymm9, %ymm14 + vpcmpeqd %ymm6, %ymm9, %ymm6 + vpor %ymm6, %ymm14, %ymm10 + vmulpd {rn-sae}, %zmm13, %zmm13, %zmm14 + vmovups dA18+__svml_datan2_data_internal(%rip), %zmm9 + vpor %ymm10, %ymm8, %ymm6 + vmovups dA17+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd231pd {rn-sae}, %zmm14, %zmm11, %zmm15 + vmovups dA14+__svml_datan2_data_internal(%rip), %zmm11 + vmovups dA12+__svml_datan2_data_internal(%rip), %zmm8 + vfmadd231pd {rn-sae}, %zmm14, %zmm9, %zmm11 + vmovups dA13+__svml_datan2_data_internal(%rip), %zmm9 + vfmadd231pd {rn-sae}, %zmm14, %zmm10, %zmm9 + vmovups dA16+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd231pd {rn-sae}, %zmm14, %zmm10, %zmm8 + vmovups dA11+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm15 + vmovups dA10+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm11 + vmovups dA09+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm9 + vmovups dA08+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm8 + vmovups dA07+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm15 + vmovups dA06+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm11 + vmovups dA05+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm9 + vmovups dA04+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm8 + vmovups dA03+__svml_datan2_data_internal(%rip), %zmm10 + +/* A00=1.0, account for it later VQFMA(D, dP4, dP4, dR8, dA00); */ + vmulpd {rn-sae}, %zmm14, %zmm8, %zmm8 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm15 + vmovups dA02+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm11 + vmovups dA01+__svml_datan2_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm11, %zmm12, %zmm15 + vfmadd213pd {rn-sae}, %zmm10, %zmm14, %zmm9 + vfmadd213pd {rn-sae}, %zmm8, %zmm12, %zmm9 + vmovups __svml_datan2_data_internal(%rip), %zmm8 + vfmadd213pd {rn-sae}, %zmm9, %zmm13, %zmm15 + +/* + * Reconstruction. + * dP=(R+R*dP) + dPIO2 + */ + vfmadd213pd {rn-sae}, %zmm1, %zmm1, %zmm15 + vaddpd {rn-sae}, %zmm3, %zmm15, %zmm1 + vorpd %zmm5, %zmm1, %zmm9 + +/* if x<0, dPI = Pi, else dPI =0 */ + vmovups dZERO+__svml_datan2_data_internal(%rip), %zmm1 + vcmppd $18, {sae}, %zmm1, %zmm7, %k2 + vaddpd {rn-sae}, %zmm8, %zmm9, %zmm9{%k2} + vmovmskps %ymm6, %eax + vorpd %zmm4, %zmm9, %zmm11 + +/* Special branch for fast (vector) processing of zero arguments */ + vmovups 64(%rsp), %zmm9 + testl %eax, %eax + +/* Go to auxilary branch */ + jne L(AUX_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm6 zmm0 zmm2 zmm3 zmm4 zmm5 zmm7 zmm9 zmm11 + +/* Return from auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH_RETURN): +/* + * Special branch for fast (vector) processing of zero arguments + * The end of implementation + */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm7 zmm11 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm11, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm7, 128(%rsp) + vmovups %zmm11, 192(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm11 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 192(%rsp), %zmm11 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm11 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + movsd 128(%rsp,%r14,8), %xmm1 + call atan2@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 192(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + cfi_restore(12) + cfi_restore(13) + cfi_restore(14) + # LOE rbx r15 r12d r13d + +/* Auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH): +/* Check if at least on of Y or Y is zero: iAXAYZERO */ + vmovups dZERO+__svml_datan2_data_internal(%rip), %zmm8 + +/* Check if both X & Y are not NaNs: iXYnotNAN */ + vcmppd $3, {sae}, %zmm7, %zmm7, %k1 + vcmppd $3, {sae}, %zmm0, %zmm0, %k2 + vcmppd $4, {sae}, %zmm8, %zmm2, %k3 + vcmppd $4, {sae}, %zmm8, %zmm9, %k4 + +/* Res = sign(Y)*(X<0)?(PIO2+PI):PIO2 */ + vpcmpgtq %zmm7, %zmm8, %k6 + vpternlogd $0xff, %zmm1, %zmm1, %zmm10 + vmovaps %zmm10, %zmm15 + vmovaps %zmm10, %zmm12 + vmovaps %zmm10, %zmm13 + vpandnq %zmm2, %zmm2, %zmm15{%k3} + vmovaps %zmm10, %zmm2 + vpandnq %zmm7, %zmm7, %zmm12{%k1} + vpandnq %zmm0, %zmm0, %zmm13{%k2} + vpandnq %zmm9, %zmm9, %zmm2{%k4} + vandpd %zmm13, %zmm12, %zmm14 + vorpd %zmm2, %zmm15, %zmm9 + vpsrlq $32, %zmm14, %zmm1 + vpsrlq $32, %zmm9, %zmm2 + vpmovqd %zmm1, %ymm1 + vpmovqd %zmm2, %ymm9 + +/* Check if at least on of Y or Y is zero and not NaN: iAXAYZEROnotNAN */ + vpand %ymm1, %ymm9, %ymm2 + +/* + * Path for zero arguments (at least one of both) + * Check if both args are zeros (den. is zero) + */ + vmovups (%rsp), %zmm1 + +/* Exclude from previous callout mask zero (and not NaN) arguments */ + vpandn %ymm6, %ymm2, %ymm6 + vcmppd $4, {sae}, %zmm8, %zmm1, %k5 + +/* Go to callout */ + vmovmskps %ymm6, %edx + vpandnq %zmm1, %zmm1, %zmm10{%k5} + +/* Set sPIO2 to zero if den. is zero */ + vpandnq %zmm3, %zmm10, %zmm3 + vpandq %zmm10, %zmm8, %zmm1 + vporq %zmm1, %zmm3, %zmm3 + vorpd %zmm5, %zmm3, %zmm1 + vmovups __svml_datan2_data_internal(%rip), %zmm5 + vaddpd {rn-sae}, %zmm5, %zmm1, %zmm1{%k6} + vorpd %zmm4, %zmm1, %zmm1 + +/* Merge results from main and spec path */ + vpmovzxdq %ymm2, %zmm4 + vpsllq $32, %zmm4, %zmm2 + vpord %zmm2, %zmm4, %zmm3 + vpandnq %zmm11, %zmm3, %zmm11 + vpandq %zmm3, %zmm1, %zmm1 + vporq %zmm1, %zmm11, %zmm11 + +/* Return to main vector processing path */ + jmp L(AUX_BRANCH_RETURN) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm7 zmm11 +END(_ZGVeN8vv_atan2_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_datan2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 dPI[8][2]; + __declspec(align(64)) VUINT32 dPIO2[8][2]; + __declspec(align(64)) VUINT32 dA19[8][2]; + __declspec(align(64)) VUINT32 dA18[8][2]; + __declspec(align(64)) VUINT32 dA17[8][2]; + __declspec(align(64)) VUINT32 dA16[8][2]; + __declspec(align(64)) VUINT32 dA15[8][2]; + __declspec(align(64)) VUINT32 dA14[8][2]; + __declspec(align(64)) VUINT32 dA13[8][2]; + __declspec(align(64)) VUINT32 dA12[8][2]; + __declspec(align(64)) VUINT32 dA11[8][2]; + __declspec(align(64)) VUINT32 dA10[8][2]; + __declspec(align(64)) VUINT32 dA09[8][2]; + __declspec(align(64)) VUINT32 dA08[8][2]; + __declspec(align(64)) VUINT32 dA07[8][2]; + __declspec(align(64)) VUINT32 dA06[8][2]; + __declspec(align(64)) VUINT32 dA05[8][2]; + __declspec(align(64)) VUINT32 dA04[8][2]; + __declspec(align(64)) VUINT32 dA03[8][2]; + __declspec(align(64)) VUINT32 dA02[8][2]; + __declspec(align(64)) VUINT32 dA01[8][2]; + __declspec(align(64)) VUINT32 dA00[8][2]; + __declspec(align(64)) VUINT32 dSIGN_MASK[8][2]; + __declspec(align(64)) VUINT32 iCHK_WORK_SUB[16][1]; + __declspec(align(64)) VUINT32 iCHK_WORK_CMP[16][1]; + __declspec(align(64)) VUINT32 dABS_MASK[8][2]; + __declspec(align(64)) VUINT32 dZERO[8][2]; +} __svml_datan2_data_internal; +#endif +__svml_datan2_data_internal: + .quad 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18, 0x400921FB54442D18 //dPI + .align 64 + .quad 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18, 0x3FF921FB54442D18 //dPIO2 + .align 64 + .quad 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3, 0xBEF4FDB537ABC7A3 // dA19 + .align 64 + .quad 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209, 0x3F2CED0A36665209 // dA18 + .align 64 + .quad 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23, 0xBF52E67C93954C23 // dA17 + .align 64 + .quad 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3, 0x3F6F5A1DAE82AFB3 // dA16 + .align 64 + .quad 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD, 0xBF82B2EC618E4BAD // dA15 + .align 64 + .quad 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5, 0x3F914F4C661116A5 // dA14 + .align 64 + .quad 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C, 0xBF9A5E83B081F69C // dA13 + .align 64 + .quad 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F, 0x3FA169980CB6AD4F // dA12 + .align 64 + .quad 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC, 0xBFA4EFA2E563C1BC // dA11 + .align 64 + .quad 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B, 0x3FA7EC0FBC50683B // dA10 + .align 64 + .quad 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954, 0xBFAAD261EAA09954 // dA09 + .align 64 + .quad 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF, 0x3FAE1749BD612DCF // dA08 + .align 64 + .quad 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0, 0xBFB11084009435E0 // dA07 + .align 64 + .quad 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651, 0x3FB3B12A49295651 // dA06 + .align 64 + .quad 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94, 0xBFB745D009BADA94 // dA05 + .align 64 + .quad 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5, 0x3FBC71C707F7D5B5 // dA04 + .align 64 + .quad 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7, 0xBFC2492491EE55C7 // dA03 + .align 64 + .quad 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34, 0x3FC999999997EE34 // dA02 + .align 64 + .quad 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5, 0xBFD55555555553C5 // dA01 + .align 64 + .quad 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000, 0x3FF0000000000000 // dA00 + .align 64 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 //dSIGN_MASK + .align 64 + .long 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000, 0x80300000 //iCHK_WORK_SUB + .align 64 + .long 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000, 0xfdd00000 //iCHK_WORK_CMP + .align 64 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff //dABS_MASK + .align 64 + .quad 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000 //dZERO + .align 64 + .type __svml_datan2_data_internal,@object + .size __svml_datan2_data_internal,.-__svml_datan2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core-avx2.S new file mode 100644 index 0000000000..a2a76e8bfd --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized atan2f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16vv_atan2f _ZGVeN16vv_atan2f_avx2_wrapper +#include "../svml_s_atan2f16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core.c new file mode 100644 index 0000000000..6fa806414d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atan2f, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16vv_atan2f +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16vv_atan2f, __GI__ZGVeN16vv_atan2f, + __redirect__ZGVeN16vv_atan2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core_avx512.S new file mode 100644 index 0000000000..f3477cc8e6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f16_core_avx512.S @@ -0,0 +1,399 @@ +/* Function atan2f vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + * + */ + +/* Offsets for data table __svml_satan2_data_internal + */ +#define sZERO 0 +#define sONE 64 +#define sSIGN_MASK 128 +#define sABS_MASK 192 +#define sPIO2 256 +#define sPI 320 +#define sPC8 384 +#define sPC7 448 +#define sPC6 512 +#define sPC5 576 +#define sPC4 640 +#define sPC3 704 +#define sPC2 768 +#define sPC1 832 +#define sPC0 896 +#define iCHK_WORK_SUB 960 +#define iCHK_WORK_CMP 1024 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16vv_atan2f_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $256, %rsp + xorl %edx, %edx + +/* + * #define NO_VECTOR_ZERO_ATAN2_ARGS + * Declarations + * Variables + * Constants + * The end of declarations + * Implementation + * Arguments signs + */ + vmovups sABS_MASK+__svml_satan2_data_internal(%rip), %zmm6 + vmovups sONE+__svml_satan2_data_internal(%rip), %zmm3 + +/* Testing on working interval. */ + vmovups iCHK_WORK_SUB+__svml_satan2_data_internal(%rip), %zmm9 + vmovups iCHK_WORK_CMP+__svml_satan2_data_internal(%rip), %zmm14 + +/* + * 1) If yx then a=-x, b=y, PIO2=Pi/2 + */ + vmovups sPIO2+__svml_satan2_data_internal(%rip), %zmm4 + vpternlogd $255, %zmm13, %zmm13, %zmm13 + vmovaps %zmm1, %zmm8 + vandps %zmm6, %zmm8, %zmm2 + vandps %zmm6, %zmm0, %zmm1 + vorps sSIGN_MASK+__svml_satan2_data_internal(%rip), %zmm2, %zmm5 + vpsubd %zmm9, %zmm2, %zmm10 + vpsubd %zmm9, %zmm1, %zmm12 + vxorps %zmm2, %zmm8, %zmm7 + vxorps %zmm1, %zmm0, %zmm6 + vcmpps $17, {sae}, %zmm2, %zmm1, %k1 + vpcmpgtd %zmm10, %zmm14, %k2 + vpcmpgtd %zmm12, %zmm14, %k3 + vmovups sPC6+__svml_satan2_data_internal(%rip), %zmm14 + vblendmps %zmm1, %zmm5, %zmm11{%k1} + vblendmps %zmm2, %zmm1, %zmm5{%k1} + vxorps %zmm4, %zmm4, %zmm4{%k1} + +/* + * Division a/b. + * Enabled when FMA is available and + * performance is better with NR iteration + */ + vrcp14ps %zmm5, %zmm15 + vfnmadd231ps {rn-sae}, %zmm5, %zmm15, %zmm3 + vfmadd213ps {rn-sae}, %zmm15, %zmm3, %zmm15 + vmulps {rn-sae}, %zmm15, %zmm11, %zmm3 + vfnmadd231ps {rn-sae}, %zmm5, %zmm3, %zmm11 + vfmadd213ps {rn-sae}, %zmm3, %zmm11, %zmm15 + vmovups sPC8+__svml_satan2_data_internal(%rip), %zmm11 + vpternlogd $255, %zmm3, %zmm3, %zmm3 + +/* Polynomial. */ + vmulps {rn-sae}, %zmm15, %zmm15, %zmm9 + vpandnd %zmm10, %zmm10, %zmm13{%k2} + vmulps {rn-sae}, %zmm9, %zmm9, %zmm10 + vfmadd231ps {rn-sae}, %zmm10, %zmm11, %zmm14 + vmovups sPC5+__svml_satan2_data_internal(%rip), %zmm11 + vpandnd %zmm12, %zmm12, %zmm3{%k3} + vpord %zmm3, %zmm13, %zmm3 + vmovups sPC4+__svml_satan2_data_internal(%rip), %zmm13 + vmovups sPC7+__svml_satan2_data_internal(%rip), %zmm12 + vptestmd %zmm3, %zmm3, %k0 + vfmadd213ps {rn-sae}, %zmm13, %zmm10, %zmm14 + vfmadd231ps {rn-sae}, %zmm10, %zmm12, %zmm11 + vmovups sPC3+__svml_satan2_data_internal(%rip), %zmm12 + vmovups sPC2+__svml_satan2_data_internal(%rip), %zmm13 + +/* Special branch for fast (vector) processing of zero arguments */ + kortestw %k0, %k0 + vfmadd213ps {rn-sae}, %zmm12, %zmm10, %zmm11 + vmovups sPC1+__svml_satan2_data_internal(%rip), %zmm12 + vfmadd213ps {rn-sae}, %zmm13, %zmm10, %zmm14 + vmovups sPC0+__svml_satan2_data_internal(%rip), %zmm13 + vfmadd213ps {rn-sae}, %zmm12, %zmm10, %zmm11 + vfmadd213ps {rn-sae}, %zmm13, %zmm10, %zmm14 + vfmadd213ps {rn-sae}, %zmm14, %zmm9, %zmm11 + +/* Reconstruction. */ + vfmadd213ps {rn-sae}, %zmm4, %zmm15, %zmm11 + +/* if x<0, sPI = Pi, else sPI =0 */ + vmovups __svml_satan2_data_internal(%rip), %zmm15 + vorps %zmm7, %zmm11, %zmm9 + vcmpps $18, {sae}, %zmm15, %zmm8, %k4 + vmovups sPI+__svml_satan2_data_internal(%rip), %zmm11 + vaddps {rn-sae}, %zmm11, %zmm9, %zmm9{%k4} + vorps %zmm6, %zmm9, %zmm10 + +/* Go to auxilary branch */ + jne L(AUX_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 zmm2 zmm3 zmm4 zmm5 zmm6 zmm7 zmm8 zmm10 zmm11 + +/* Return from auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH_RETURN): +/* + * Special branch for fast (vector) processing of zero arguments + * The end of implementation + */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm8 zmm10 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm10, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm8, 128(%rsp) + vmovups %zmm10, 192(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm10 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 192(%rsp), %zmm10 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -240; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x10, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -248; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x08, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -256; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x00, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm10 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + movss 128(%rsp,%r14,4), %xmm1 + call atan2f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 192(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + cfi_restore(12) + cfi_restore(13) + cfi_restore(14) + # LOE rbx r15 r12d r13d + +/* Auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH): +/* Check if at least on of Y or Y is zero: iAXAYZERO */ + vmovups __svml_satan2_data_internal(%rip), %zmm9 + +/* Check if both X & Y are not NaNs: iXYnotNAN */ + vcmpps $3, {sae}, %zmm8, %zmm8, %k1 + vcmpps $3, {sae}, %zmm0, %zmm0, %k2 + vpcmpd $4, %zmm9, %zmm2, %k3 + vpcmpd $4, %zmm9, %zmm1, %k4 + +/* + * Path for zero arguments (at least one of both) + * Check if both args are zeros (den. is zero) + */ + vcmpps $4, {sae}, %zmm9, %zmm5, %k5 + +/* Res = sign(Y)*(X<0)?(PIO2+PI):PIO2 */ + vpcmpgtd %zmm8, %zmm9, %k6 + vpternlogd $255, %zmm14, %zmm14, %zmm14 + vpternlogd $255, %zmm12, %zmm12, %zmm12 + vpternlogd $255, %zmm13, %zmm13, %zmm13 + vpandnd %zmm2, %zmm2, %zmm14{%k3} + vpternlogd $255, %zmm2, %zmm2, %zmm2 + vpandnd %zmm1, %zmm1, %zmm2{%k4} + vpord %zmm2, %zmm14, %zmm15 + vpternlogd $255, %zmm2, %zmm2, %zmm2 + vpandnd %zmm5, %zmm5, %zmm2{%k5} + +/* Set sPIO2 to zero if den. is zero */ + vpandnd %zmm4, %zmm2, %zmm4 + vpandd %zmm2, %zmm9, %zmm5 + vpord %zmm5, %zmm4, %zmm2 + vorps %zmm7, %zmm2, %zmm7 + vaddps {rn-sae}, %zmm11, %zmm7, %zmm7{%k6} + vorps %zmm6, %zmm7, %zmm6 + vpandnd %zmm8, %zmm8, %zmm12{%k1} + vpandnd %zmm0, %zmm0, %zmm13{%k2} + vandps %zmm13, %zmm12, %zmm12 + +/* Check if at least on of Y or Y is zero and not NaN: iAXAYZEROnotNAN */ + vpandd %zmm12, %zmm15, %zmm1 + +/* Exclude from previous callout mask zero (and not NaN) arguments */ + vpandnd %zmm3, %zmm1, %zmm3 + +/* Go to callout */ + vptestmd %zmm3, %zmm3, %k0 + kmovw %k0, %edx + +/* Merge results from main and spec path */ + vpandnd %zmm10, %zmm1, %zmm10 + vpandd %zmm1, %zmm6, %zmm11 + vpord %zmm11, %zmm10, %zmm10 + +/* Return to main vector processing path */ + jmp L(AUX_BRANCH_RETURN) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm8 zmm10 +END(_ZGVeN16vv_atan2f_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_satan2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 sZERO[16][1]; + __declspec(align(64)) VUINT32 sONE[16][1]; + __declspec(align(64)) VUINT32 sSIGN_MASK[16][1]; + __declspec(align(64)) VUINT32 sABS_MASK[16][1]; + __declspec(align(64)) VUINT32 sPIO2[16][1]; + __declspec(align(64)) VUINT32 sPI[16][1]; + __declspec(align(64)) VUINT32 sPC8[16][1]; + __declspec(align(64)) VUINT32 sPC7[16][1]; + __declspec(align(64)) VUINT32 sPC6[16][1]; + __declspec(align(64)) VUINT32 sPC5[16][1]; + __declspec(align(64)) VUINT32 sPC4[16][1]; + __declspec(align(64)) VUINT32 sPC3[16][1]; + __declspec(align(64)) VUINT32 sPC2[16][1]; + __declspec(align(64)) VUINT32 sPC1[16][1]; + __declspec(align(64)) VUINT32 sPC0[16][1]; + __declspec(align(64)) VUINT32 iCHK_WORK_SUB[16][1]; + __declspec(align(64)) VUINT32 iCHK_WORK_CMP[16][1]; +} __svml_satan2_data_internal; +#endif +__svml_satan2_data_internal: + .long 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 // sZERO + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 // sONE + .align 64 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 // sSIGN_MASK + .align 64 + .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF // sABS_MASK + .align 64 + .long 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB // sPIO2 + .align 64 + .long 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB // sPI + .align 64 + .long 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0 // sA08 + .align 64 + .long 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631 // sA07 + .align 64 + .long 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384 // sA06 + .align 64 + .long 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629 // sA05 + .align 64 + .long 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474 // sA04 + .align 64 + .long 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8 // sA03 + .align 64 + .long 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F // sA02 + .align 64 + .long 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49 // sA01 + .align 64 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 // sA00 + .align 64 + .long 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000 //iCHK_WORK_SUB + .align 64 + .long 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000 //iCHK_WORK_CMP + .align 64 + .type __svml_satan2_data_internal,@object + .size __svml_satan2_data_internal,.-__svml_satan2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core-sse2.S new file mode 100644 index 0000000000..d1a67facf1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized atan2f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4vv_atan2f _ZGVbN4vv_atan2f_sse2 +#include "../svml_s_atan2f4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core.c new file mode 100644 index 0000000000..ee882b0557 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atan2f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4vv_atan2f +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4vv_atan2f, __GI__ZGVbN4vv_atan2f, + __redirect__ZGVbN4vv_atan2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core_sse4.S new file mode 100644 index 0000000000..e4fbe82501 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f4_core_sse4.S @@ -0,0 +1,384 @@ +/* Function atan2f vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + * + */ + +/* Offsets for data table __svml_satan2_data_internal + */ +#define sZERO 0 +#define sSIGN_MASK 16 +#define sABS_MASK 32 +#define sPIO2 48 +#define sPI 64 +#define sPC8 80 +#define sPC7 96 +#define sPC6 112 +#define sPC5 128 +#define sPC4 144 +#define sPC3 160 +#define sPC2 176 +#define sPC1 192 +#define sPC0 208 +#define iCHK_WORK_SUB 224 +#define iCHK_WORK_CMP 240 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4vv_atan2f_sse4) + subq $88, %rsp + cfi_def_cfa_offset(96) + movaps %xmm0, %xmm12 + +/* + * #define NO_VECTOR_ZERO_ATAN2_ARGS + * Declarations + * Variables + * Constants + * The end of declarations + * Implementation + * Arguments signs + */ + movups sABS_MASK+__svml_satan2_data_internal(%rip), %xmm10 + movaps %xmm1, %xmm13 + movaps %xmm10, %xmm11 + andps %xmm12, %xmm10 + andps %xmm13, %xmm11 + movaps %xmm10, %xmm7 + cmpltps %xmm11, %xmm7 + +/* + * 1) If yx then a=-x, b=y, PIO2=Pi/2 + */ + movups sSIGN_MASK+__svml_satan2_data_internal(%rip), %xmm6 + movaps %xmm7, %xmm0 + orps %xmm11, %xmm6 + movaps %xmm10, %xmm4 + andnps %xmm6, %xmm0 + movaps %xmm7, %xmm6 + movaps %xmm11, %xmm5 + andps %xmm7, %xmm4 + andnps %xmm10, %xmm6 + andps %xmm7, %xmm5 + orps %xmm4, %xmm0 + orps %xmm5, %xmm6 + +/* Division a/b. */ + divps %xmm6, %xmm0 + +/* Testing on working interval. */ + movdqu iCHK_WORK_SUB+__svml_satan2_data_internal(%rip), %xmm14 + movaps %xmm11, %xmm15 + movaps %xmm10, %xmm3 + psubd %xmm14, %xmm15 + psubd %xmm14, %xmm3 + movdqa %xmm15, %xmm1 + movdqu iCHK_WORK_CMP+__svml_satan2_data_internal(%rip), %xmm2 + movdqa %xmm3, %xmm14 + pcmpgtd %xmm2, %xmm1 + pcmpeqd %xmm2, %xmm15 + pcmpgtd %xmm2, %xmm14 + pcmpeqd %xmm2, %xmm3 + por %xmm15, %xmm1 + por %xmm3, %xmm14 + por %xmm14, %xmm1 + +/* Polynomial. */ + movaps %xmm0, %xmm14 + mulps %xmm0, %xmm14 + movaps %xmm13, %xmm4 + movmskps %xmm1, %ecx + movaps %xmm14, %xmm15 + movaps %xmm11, %xmm9 + mulps %xmm14, %xmm15 + pxor %xmm13, %xmm9 + movups sPC8+__svml_satan2_data_internal(%rip), %xmm2 + movaps %xmm10, %xmm8 + mulps %xmm15, %xmm2 + pxor %xmm12, %xmm8 + movups sPC7+__svml_satan2_data_internal(%rip), %xmm3 + xorl %edx, %edx + mulps %xmm15, %xmm3 + addps sPC6+__svml_satan2_data_internal(%rip), %xmm2 + mulps %xmm15, %xmm2 + addps sPC5+__svml_satan2_data_internal(%rip), %xmm3 + mulps %xmm15, %xmm3 + addps sPC4+__svml_satan2_data_internal(%rip), %xmm2 + mulps %xmm15, %xmm2 + addps sPC3+__svml_satan2_data_internal(%rip), %xmm3 + mulps %xmm15, %xmm3 + addps sPC2+__svml_satan2_data_internal(%rip), %xmm2 + mulps %xmm2, %xmm15 + addps sPC1+__svml_satan2_data_internal(%rip), %xmm3 + mulps %xmm3, %xmm14 + addps sPC0+__svml_satan2_data_internal(%rip), %xmm15 + +/* if x<0, sPI = Pi, else sPI =0 */ + movups __svml_satan2_data_internal(%rip), %xmm5 + xorl %eax, %eax + andnps sPIO2+__svml_satan2_data_internal(%rip), %xmm7 + addps %xmm14, %xmm15 + cmpleps %xmm5, %xmm4 + +/* Reconstruction. */ + mulps %xmm15, %xmm0 + andps sPI+__svml_satan2_data_internal(%rip), %xmm4 + addps %xmm7, %xmm0 + orps %xmm9, %xmm0 + addps %xmm4, %xmm0 + orps %xmm8, %xmm0 + +/* Special branch for fast (vector) processing of zero arguments */ + testl %ecx, %ecx + +/* Go to auxilary branch */ + jne L(AUX_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 xmm1 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 + +/* Return from auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH_RETURN): +/* + * Special branch for fast (vector) processing of zero arguments + * The end of implementation + */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 xmm12 xmm13 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $88, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(96) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm12, 32(%rsp) + movups %xmm13, 48(%rsp) + movups %xmm0, 64(%rsp) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 + + movq %r12, 16(%rsp) + cfi_offset(12, -80) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -88) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -96) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 64(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -80) + cfi_offset(13, -88) + cfi_offset(14, -96) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + movss 48(%rsp,%r14,4), %xmm1 + call atan2f@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + cfi_restore(12) + cfi_restore(13) + cfi_restore(14) + # LOE rbx rbp r15 r12d r13d + +/* Auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH): +/* Check if both X & Y are not NaNs: iXYnotNAN */ + movaps %xmm13, %xmm3 + movaps %xmm12, %xmm2 + cmpordps %xmm13, %xmm3 + cmpordps %xmm12, %xmm2 + +/* + * Path for zero arguments (at least one of both) + * Check if both args are zeros (den. is zero) + */ + cmpeqps %xmm5, %xmm6 + +/* Check if at least on of Y or Y is zero: iAXAYZERO */ + pcmpeqd %xmm5, %xmm11 + pcmpeqd %xmm5, %xmm10 + andps %xmm2, %xmm3 + por %xmm10, %xmm11 + +/* Check if at least on of Y or Y is zero and not NaN: iAXAYZEROnotNAN */ + andps %xmm3, %xmm11 + +/* Exclude from previous callout mask zero (and not NaN) arguments */ + movaps %xmm11, %xmm10 + pandn %xmm1, %xmm10 + +/* Set sPIO2 to zero if den. is zero */ + movaps %xmm6, %xmm1 + andnps %xmm7, %xmm1 + andps %xmm5, %xmm6 + orps %xmm6, %xmm1 + +/* Res = sign(Y)*(X<0)?(PIO2+PI):PIO2 */ + pcmpgtd %xmm13, %xmm5 + orps %xmm9, %xmm1 + andps %xmm4, %xmm5 + +/* Merge results from main and spec path */ + movaps %xmm11, %xmm4 + addps %xmm5, %xmm1 + +/* Go to callout */ + movmskps %xmm10, %edx + orps %xmm8, %xmm1 + andnps %xmm0, %xmm4 + andps %xmm11, %xmm1 + movaps %xmm4, %xmm0 + orps %xmm1, %xmm0 + +/* Return to main vector processing path */ + jmp L(AUX_BRANCH_RETURN) + # LOE rbx rbp r12 r13 r14 r15 eax edx xmm0 xmm12 xmm13 +END(_ZGVbN4vv_atan2f_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_satan2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 sZERO[4][1]; + __declspec(align(16)) VUINT32 sSIGN_MASK[4][1]; + __declspec(align(16)) VUINT32 sABS_MASK[4][1]; + __declspec(align(16)) VUINT32 sPIO2[4][1]; + __declspec(align(16)) VUINT32 sPI[4][1]; + __declspec(align(16)) VUINT32 sPC8[4][1]; + __declspec(align(16)) VUINT32 sPC7[4][1]; + __declspec(align(16)) VUINT32 sPC6[4][1]; + __declspec(align(16)) VUINT32 sPC5[4][1]; + __declspec(align(16)) VUINT32 sPC4[4][1]; + __declspec(align(16)) VUINT32 sPC3[4][1]; + __declspec(align(16)) VUINT32 sPC2[4][1]; + __declspec(align(16)) VUINT32 sPC1[4][1]; + __declspec(align(16)) VUINT32 sPC0[4][1]; + __declspec(align(16)) VUINT32 iCHK_WORK_SUB[4][1]; + __declspec(align(16)) VUINT32 iCHK_WORK_CMP[4][1]; +} __svml_satan2_data_internal; +#endif +__svml_satan2_data_internal: + .long 0x00000000, 0x00000000, 0x00000000, 0x00000000 // sZERO + .align 16 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 // sSIGN_MASK + .align 16 + .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF // sABS_MASK + .align 16 + .long 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB // sPIO2 + .align 16 + .long 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB // sPI + .align 16 + .long 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0 // sA08 + .align 16 + .long 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631 // sA07 + .align 16 + .long 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384 // sA06 + .align 16 + .long 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629 // sA05 + .align 16 + .long 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474 // sA04 + .align 16 + .long 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8 // sA03 + .align 16 + .long 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F // sA02 + .align 16 + .long 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49 // sA01 + .align 16 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 // sA00 + .align 16 + .long 0x81000000, 0x81000000, 0x81000000, 0x81000000 //iCHK_WORK_SUB + .align 16 + .long 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000 //iCHK_WORK_CMP + .align 16 + .type __svml_satan2_data_internal,@object + .size __svml_satan2_data_internal,.-__svml_satan2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core-sse.S new file mode 100644 index 0000000000..21b1d3ff63 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized atan2f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8vv_atan2f _ZGVdN8vv_atan2f_sse_wrapper +#include "../svml_s_atan2f8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core.c new file mode 100644 index 0000000000..7e02050983 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized sinf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8vv_atan2f +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8vv_atan2f, __GI__ZGVdN8vv_atan2f, + __redirect__ZGVdN8vv_atan2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core_avx2.S new file mode 100644 index 0000000000..2e6e5eb71c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atan2f8_core_avx2.S @@ -0,0 +1,362 @@ +/* Function atan2f vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * For 0.0 <= x <= 7.0/16.0: atan(x) = atan(0.0) + atan(s), where s=(x-0.0)/(1.0+0.0*x) + * For 7.0/16.0 <= x <= 11.0/16.0: atan(x) = atan(0.5) + atan(s), where s=(x-0.5)/(1.0+0.5*x) + * For 11.0/16.0 <= x <= 19.0/16.0: atan(x) = atan(1.0) + atan(s), where s=(x-1.0)/(1.0+1.0*x) + * For 19.0/16.0 <= x <= 39.0/16.0: atan(x) = atan(1.5) + atan(s), where s=(x-1.5)/(1.0+1.5*x) + * For 39.0/16.0 <= x <= inf : atan(x) = atan(inf) + atan(s), where s=-1.0/x + * Where atan(s) ~= s+s^3*Poly11(s^2) on interval |s|<7.0/0.16. + * + * + */ + +/* Offsets for data table __svml_satan2_data_internal + */ +#define sZERO 0 +#define sSIGN_MASK 32 +#define sABS_MASK 64 +#define sPIO2 96 +#define sPI 128 +#define sPC8 160 +#define sPC7 192 +#define sPC6 224 +#define sPC5 256 +#define sPC4 288 +#define sPC3 320 +#define sPC2 352 +#define sPC1 384 +#define sPC0 416 +#define iCHK_WORK_SUB 448 +#define iCHK_WORK_CMP 480 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8vv_atan2f_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $128, %rsp + xorl %edx, %edx + +/* + * #define NO_VECTOR_ZERO_ATAN2_ARGS + * Declarations + * Variables + * Constants + * The end of declarations + * Implementation + * Arguments signs + */ + vmovups sABS_MASK+__svml_satan2_data_internal(%rip), %ymm2 + +/* Testing on working interval. */ + vmovups iCHK_WORK_SUB+__svml_satan2_data_internal(%rip), %ymm15 + vmovups iCHK_WORK_CMP+__svml_satan2_data_internal(%rip), %ymm9 + +/* if x<0, sPI = Pi, else sPI =0 */ + vmovups __svml_satan2_data_internal(%rip), %ymm5 + vmovaps %ymm1, %ymm7 + vandps %ymm2, %ymm7, %ymm13 + vandps %ymm2, %ymm0, %ymm12 + vcmplt_oqps %ymm13, %ymm12, %ymm4 + vcmple_oqps %ymm5, %ymm7, %ymm6 + vpsubd %ymm15, %ymm13, %ymm10 + vpsubd %ymm15, %ymm12, %ymm8 + +/* + * 1) If yx then a=-x, b=y, PIO2=Pi/2 + */ + vorps sSIGN_MASK+__svml_satan2_data_internal(%rip), %ymm13, %ymm3 + vblendvps %ymm4, %ymm12, %ymm3, %ymm14 + vblendvps %ymm4, %ymm13, %ymm12, %ymm3 + +/* Division a/b. */ + vdivps %ymm3, %ymm14, %ymm11 + vpcmpgtd %ymm9, %ymm10, %ymm14 + vpcmpeqd %ymm9, %ymm10, %ymm15 + vpor %ymm15, %ymm14, %ymm10 + vmovups sPC7+__svml_satan2_data_internal(%rip), %ymm15 + vpcmpgtd %ymm9, %ymm8, %ymm14 + vpcmpeqd %ymm9, %ymm8, %ymm8 + vpor %ymm8, %ymm14, %ymm9 + vmovups sPC8+__svml_satan2_data_internal(%rip), %ymm14 + vpor %ymm9, %ymm10, %ymm10 + +/* Polynomial. */ + vmulps %ymm11, %ymm11, %ymm9 + vmulps %ymm9, %ymm9, %ymm8 + vfmadd213ps sPC6+__svml_satan2_data_internal(%rip), %ymm8, %ymm14 + vfmadd213ps sPC5+__svml_satan2_data_internal(%rip), %ymm8, %ymm15 + vfmadd213ps sPC4+__svml_satan2_data_internal(%rip), %ymm8, %ymm14 + vfmadd213ps sPC3+__svml_satan2_data_internal(%rip), %ymm8, %ymm15 + vfmadd213ps sPC2+__svml_satan2_data_internal(%rip), %ymm8, %ymm14 + vfmadd213ps sPC1+__svml_satan2_data_internal(%rip), %ymm8, %ymm15 + vfmadd213ps sPC0+__svml_satan2_data_internal(%rip), %ymm8, %ymm14 + vfmadd213ps %ymm14, %ymm9, %ymm15 + vandnps sPIO2+__svml_satan2_data_internal(%rip), %ymm4, %ymm4 + +/* Reconstruction. */ + vfmadd213ps %ymm4, %ymm11, %ymm15 + vxorps %ymm13, %ymm7, %ymm1 + vandps sPI+__svml_satan2_data_internal(%rip), %ymm6, %ymm6 + vorps %ymm1, %ymm15, %ymm11 + vaddps %ymm11, %ymm6, %ymm8 + vmovmskps %ymm10, %eax + vxorps %ymm12, %ymm0, %ymm2 + vorps %ymm2, %ymm8, %ymm9 + +/* Special branch for fast (vector) processing of zero arguments */ + testl %eax, %eax + +/* Go to auxilary branch */ + jne L(AUX_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 ymm9 ymm10 ymm12 ymm13 + +/* Return from auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH_RETURN): +/* + * Special branch for fast (vector) processing of zero arguments + * The end of implementation + */ + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm7 ymm9 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %ymm9, %ymm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm0, 32(%rsp) + vmovups %ymm7, 64(%rsp) + vmovups %ymm9, 96(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm9 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 96(%rsp), %ymm9 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -112; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x90, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm9 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + movss 64(%rsp,%r14,4), %xmm1 + call atan2f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 96(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + cfi_restore(12) + cfi_restore(13) + cfi_restore(14) + # LOE rbx r15 r12d r13d + +/* Auxilary branch + * for out of main path inputs + */ + +L(AUX_BRANCH): +/* Check if at least on of Y or Y is zero: iAXAYZERO */ + vpcmpeqd %ymm5, %ymm13, %ymm13 + vpcmpeqd %ymm5, %ymm12, %ymm12 + +/* Check if both X & Y are not NaNs: iXYnotNAN */ + vcmpordps %ymm7, %ymm7, %ymm11 + vcmpordps %ymm0, %ymm0, %ymm14 + +/* + * Path for zero arguments (at least one of both) + * Check if both args are zeros (den. is zero) + */ + vcmpeqps %ymm5, %ymm3, %ymm3 + vpor %ymm12, %ymm13, %ymm15 + +/* Set sPIO2 to zero if den. is zero */ + vblendvps %ymm3, %ymm5, %ymm4, %ymm4 + vandps %ymm14, %ymm11, %ymm8 + +/* Check if at least on of Y or Y is zero and not NaN: iAXAYZEROnotNAN */ + vpand %ymm8, %ymm15, %ymm8 + +/* Res = sign(Y)*(X<0)?(PIO2+PI):PIO2 */ + vpcmpgtd %ymm7, %ymm5, %ymm5 + vorps %ymm1, %ymm4, %ymm1 + vandps %ymm6, %ymm5, %ymm6 + vaddps %ymm6, %ymm1, %ymm1 + +/* Exclude from previous callout mask zero (and not NaN) arguments */ + vpandn %ymm10, %ymm8, %ymm10 + vorps %ymm2, %ymm1, %ymm2 + +/* Go to callout */ + vmovmskps %ymm10, %edx + +/* Merge results from main and spec path */ + vblendvps %ymm8, %ymm2, %ymm9, %ymm9 + +/* Return to main vector processing path */ + jmp L(AUX_BRANCH_RETURN) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm7 ymm9 +END(_ZGVdN8vv_atan2f_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_satan2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 sZERO[8][1]; + __declspec(align(32)) VUINT32 sSIGN_MASK[8][1]; + __declspec(align(32)) VUINT32 sABS_MASK[8][1]; + __declspec(align(32)) VUINT32 sPIO2[8][1]; + __declspec(align(32)) VUINT32 sPI[8][1]; + __declspec(align(32)) VUINT32 sPC8[8][1]; + __declspec(align(32)) VUINT32 sPC7[8][1]; + __declspec(align(32)) VUINT32 sPC6[8][1]; + __declspec(align(32)) VUINT32 sPC5[8][1]; + __declspec(align(32)) VUINT32 sPC4[8][1]; + __declspec(align(32)) VUINT32 sPC3[8][1]; + __declspec(align(32)) VUINT32 sPC2[8][1]; + __declspec(align(32)) VUINT32 sPC1[8][1]; + __declspec(align(32)) VUINT32 sPC0[8][1]; + __declspec(align(32)) VUINT32 iCHK_WORK_SUB[8][1]; + __declspec(align(32)) VUINT32 iCHK_WORK_CMP[8][1]; +} __svml_satan2_data_internal; +#endif +__svml_satan2_data_internal: + .long 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000 // sZERO + .align 32 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 // sSIGN_MASK + .align 32 + .long 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF // sABS_MASK + .align 32 + .long 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB // sPIO2 + .align 32 + .long 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB // sPI + .align 32 + .long 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0, 0x3B322CC0 // sA08 + .align 32 + .long 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631, 0xBC7F2631 // sA07 + .align 32 + .long 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384, 0x3D2BC384 // sA06 + .align 32 + .long 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629, 0xBD987629 // sA05 + .align 32 + .long 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474, 0x3DD96474 // sA04 + .align 32 + .long 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8, 0xBE1161F8 // sA03 + .align 32 + .long 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F, 0x3E4CB79F // sA02 + .align 32 + .long 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49, 0xBEAAAA49 // sA01 + .align 32 + .long 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 // sA00 + .align 32 + .long 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000, 0x81000000 //iCHK_WORK_SUB + .align 32 + .long 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000, 0xFC000000 //iCHK_WORK_CMP + .align 32 + .type __svml_satan2_data_internal,@object + .size __svml_satan2_data_internal,.-__svml_satan2_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_atan22_core.S b/sysdeps/x86_64/fpu/svml_d_atan22_core.S new file mode 100644 index 0000000000..f3089e70f9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan22_core.S @@ -0,0 +1,29 @@ +/* Function atan2 vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2vv_atan2) +WRAPPER_IMPL_SSE2_ff atan2 +END (_ZGVbN2vv_atan2) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2vv_atan2) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_atan24_core.S b/sysdeps/x86_64/fpu/svml_d_atan24_core.S new file mode 100644 index 0000000000..8a163d12d2 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan24_core.S @@ -0,0 +1,29 @@ +/* Function atan2 vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4vv_atan2) +WRAPPER_IMPL_AVX_ff _ZGVbN2vv_atan2 +END (_ZGVdN4vv_atan2) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4vv_atan2) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_atan24_core_avx.S b/sysdeps/x86_64/fpu/svml_d_atan24_core_avx.S new file mode 100644 index 0000000000..0ee5ae8faf --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan24_core_avx.S @@ -0,0 +1,25 @@ +/* Function atan2 vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4vv_atan2) +WRAPPER_IMPL_AVX_ff _ZGVbN2vv_atan2 +END (_ZGVcN4vv_atan2) diff --git a/sysdeps/x86_64/fpu/svml_d_atan28_core.S b/sysdeps/x86_64/fpu/svml_d_atan28_core.S new file mode 100644 index 0000000000..b85f696686 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atan28_core.S @@ -0,0 +1,25 @@ +/* Function atan2 vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8vv_atan2) +WRAPPER_IMPL_AVX512_ff _ZGVdN4vv_atan2 +END (_ZGVeN8vv_atan2) diff --git a/sysdeps/x86_64/fpu/svml_s_atan2f16_core.S b/sysdeps/x86_64/fpu/svml_s_atan2f16_core.S new file mode 100644 index 0000000000..25acb31dfb --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atan2f16_core.S @@ -0,0 +1,25 @@ +/* Function atan2f vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16vv_atan2f) +WRAPPER_IMPL_AVX512_ff _ZGVdN8vv_atan2f +END (_ZGVeN16vv_atan2f) diff --git a/sysdeps/x86_64/fpu/svml_s_atan2f4_core.S b/sysdeps/x86_64/fpu/svml_s_atan2f4_core.S new file mode 100644 index 0000000000..bc99f0ba10 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atan2f4_core.S @@ -0,0 +1,29 @@ +/* Function atan2f vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4vv_atan2f) +WRAPPER_IMPL_SSE2_ff atan2f +END (_ZGVbN4vv_atan2f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4vv_atan2f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_atan2f8_core.S b/sysdeps/x86_64/fpu/svml_s_atan2f8_core.S new file mode 100644 index 0000000000..bfcdb3c372 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atan2f8_core.S @@ -0,0 +1,29 @@ +/* Function atan2f vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8vv_atan2f) +WRAPPER_IMPL_AVX_ff _ZGVbN4vv_atan2f +END (_ZGVdN8vv_atan2f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8vv_atan2f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_atan2f8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_atan2f8_core_avx.S new file mode 100644 index 0000000000..1aa8d05822 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atan2f8_core_avx.S @@ -0,0 +1,25 @@ +/* Function atan2f vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY(_ZGVcN8vv_atan2f) +WRAPPER_IMPL_AVX_ff _ZGVbN4vv_atan2f +END(_ZGVcN8vv_atan2f) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx.c new file mode 100644 index 0000000000..e423bce25b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atan2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx2.c new file mode 100644 index 0000000000..e423bce25b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atan2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx512f.c new file mode 100644 index 0000000000..e423bce25b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan2-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atan2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atan2.c b/sysdeps/x86_64/fpu/test-double-libmvec-atan2.c new file mode 100644 index 0000000000..d0aa626d95 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atan2.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC atan2 +#include "test-vector-abi-arg2.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index b1981ac7e4..37a7a1c777 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVbN2v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVbN2v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVbN2v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVbN2v_cbrt) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVbN2vv_atan2) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 47915a7e59..4313f67e06 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVdN4v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVdN4v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVdN4v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVdN4v_cbrt) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVdN4vv_atan2) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 5cd5049807..4b8b00f16d 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVcN4v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVcN4v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVcN4v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVcN4v_cbrt) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVcN4vv_atan2) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 83970739ab..d06522a407 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cosh), _ZGVeN8v_cosh) VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVeN8v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVeN8v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVeN8v_cbrt) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVeN8vv_atan2) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx.c new file mode 100644 index 0000000000..5c7e2c9ad5 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atan2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx2.c new file mode 100644 index 0000000000..5c7e2c9ad5 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atan2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx512f.c new file mode 100644 index 0000000000..5c7e2c9ad5 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atan2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atan2f.c b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f.c new file mode 100644 index 0000000000..beb5c745cb --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atan2f.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC atan2f +#include "test-vector-abi-arg2.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 0420f11c28..0bd631bf9a 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVeN16v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVeN16v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVeN16v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVeN16v_cbrtf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVeN16vv_atan2f) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index c8f7580265..1018398bd3 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVbN4v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVbN4v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVbN4v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVbN4v_cbrtf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVbN4vv_atan2f) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index b581796b88..42ea28f30f 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVdN8v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVdN8v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVdN8v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVdN8v_cbrtf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVdN8vv_atan2f) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index f16789e5ff..70a0216a07 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -37,6 +37,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (coshf), _ZGVcN8v_coshf) VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVcN8v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVcN8v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVcN8v_cbrtf) +VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVcN8vv_atan2f) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573826 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=JIz8UDYS; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmMb3X96z9sPC for ; Wed, 29 Dec 2021 07:28:43 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A59F1385843E for ; Tue, 28 Dec 2021 20:28:40 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A59F1385843E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640723320; bh=rMP2L7IiBo/KmTZMu3T2wHFuWVA6vkrNyd9cTne44ME=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=JIz8UDYSahDEDBSnfCfSGm6PRLT9ehQHCBPFQRMLZFeulO2FAEouIr7+ouE/y95Ca hi3JWeeiuei6G1flKxMdGltx2xoDmb20mCK+GUqDZhJ0FiveU7Ub9/5ik9mMlvfLXx cD8MTYUXxqRzCD6nZd7dejt4aj2CrX1q1+dh0K4k= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by sourceware.org (Postfix) with ESMTPS id A998B385842A for ; Tue, 28 Dec 2021 20:11:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A998B385842A X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="241218100" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="241218100" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="666095349" Received: from scymds01.sc.intel.com ([10.148.94.138]) by fmsmga001.fm.intel.com with ESMTP; 28 Dec 2021 12:11:32 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsf016522; Tue, 28 Dec 2021 12:11:32 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 11/18] x86-64: Add vector log10/log10f implementation to libmvec Date: Tue, 28 Dec 2021 12:11:23 -0800 Message-Id: <20211228201130.737370-12-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized log10/log10f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector log10/log10f with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_log102_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_log102_core.c | 27 + .../fpu/multiarch/svml_d_log102_core_sse4.S | 1086 +++++++++++++++++ .../fpu/multiarch/svml_d_log104_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_log104_core.c | 27 + .../fpu/multiarch/svml_d_log104_core_avx2.S | 1071 ++++++++++++++++ .../fpu/multiarch/svml_d_log108_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_log108_core.c | 27 + .../fpu/multiarch/svml_d_log108_core_avx512.S | 299 +++++ .../fpu/multiarch/svml_s_log10f16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_log10f16_core.c | 28 + .../multiarch/svml_s_log10f16_core_avx512.S | 238 ++++ .../fpu/multiarch/svml_s_log10f4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_log10f4_core.c | 28 + .../fpu/multiarch/svml_s_log10f4_core_sse4.S | 243 ++++ .../fpu/multiarch/svml_s_log10f8_core-sse.S | 20 + .../fpu/multiarch/svml_s_log10f8_core.c | 28 + .../fpu/multiarch/svml_s_log10f8_core_avx2.S | 243 ++++ sysdeps/x86_64/fpu/svml_d_log102_core.S | 29 + sysdeps/x86_64/fpu/svml_d_log104_core.S | 29 + sysdeps/x86_64/fpu/svml_d_log104_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_log108_core.S | 25 + sysdeps/x86_64/fpu/svml_s_log10f16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_log10f4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_log10f8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_log10f8_core_avx.S | 25 + .../fpu/test-double-libmvec-log10-avx.c | 1 + .../fpu/test-double-libmvec-log10-avx2.c | 1 + .../fpu/test-double-libmvec-log10-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-log10.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-log10f-avx.c | 1 + .../fpu/test-float-libmvec-log10f-avx2.c | 1 + .../fpu/test-float-libmvec-log10f-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-log10f.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 3752 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log102_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log102_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log102_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log104_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log104_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log104_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log108_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log108_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log108_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log102_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log104_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log104_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log108_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log10f16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log10f4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log10f8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log10f8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log10-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log10-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log10-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log10.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log10f.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 31878bf4ed..4ad584c227 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -219,4 +219,15 @@ #define __DECL_SIMD_atan2f32x #define __DECL_SIMD_atan2f64x #define __DECL_SIMD_atan2f128x + +#define __DECL_SIMD_log10 +#define __DECL_SIMD_log10f +#define __DECL_SIMD_log10l +#define __DECL_SIMD_log10f16 +#define __DECL_SIMD_log10f32 +#define __DECL_SIMD_log10f64 +#define __DECL_SIMD_log10f128 +#define __DECL_SIMD_log10f32x +#define __DECL_SIMD_log10f64x +#define __DECL_SIMD_log10f128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 1bd4911993..f21384758a 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -104,7 +104,7 @@ __MATHCALL (ldexp,, (_Mdouble_ __x, int __exponent)); __MATHCALL_VEC (log,, (_Mdouble_ __x)); /* Base-ten logarithm of X. */ -__MATHCALL (log10,, (_Mdouble_ __x)); +__MATHCALL_VEC (log10,, (_Mdouble_ __x)); /* Break VALUE into integral and fractional parts. */ __MATHCALL (modf,, (_Mdouble_ __x, _Mdouble_ *__iptr)) __nonnull ((2)); diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 2b3b8d3886..8108a2a189 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -54,6 +54,7 @@ GLIBC_2.35 _ZGVbN2v_cosh F GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2v_expm1 F +GLIBC_2.35 _ZGVbN2v_log10 F GLIBC_2.35 _ZGVbN2v_sinh F GLIBC_2.35 _ZGVbN2vv_atan2 F GLIBC_2.35 _ZGVbN2vv_hypot F @@ -65,6 +66,7 @@ GLIBC_2.35 _ZGVbN4v_coshf F GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4v_expm1f F +GLIBC_2.35 _ZGVbN4v_log10f F GLIBC_2.35 _ZGVbN4v_sinhf F GLIBC_2.35 _ZGVbN4vv_atan2f F GLIBC_2.35 _ZGVbN4vv_hypotf F @@ -76,6 +78,7 @@ GLIBC_2.35 _ZGVcN4v_cosh F GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4v_expm1 F +GLIBC_2.35 _ZGVcN4v_log10 F GLIBC_2.35 _ZGVcN4v_sinh F GLIBC_2.35 _ZGVcN4vv_atan2 F GLIBC_2.35 _ZGVcN4vv_hypot F @@ -87,6 +90,7 @@ GLIBC_2.35 _ZGVcN8v_coshf F GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8v_expm1f F +GLIBC_2.35 _ZGVcN8v_log10f F GLIBC_2.35 _ZGVcN8v_sinhf F GLIBC_2.35 _ZGVcN8vv_atan2f F GLIBC_2.35 _ZGVcN8vv_hypotf F @@ -98,6 +102,7 @@ GLIBC_2.35 _ZGVdN4v_cosh F GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4v_expm1 F +GLIBC_2.35 _ZGVdN4v_log10 F GLIBC_2.35 _ZGVdN4v_sinh F GLIBC_2.35 _ZGVdN4vv_atan2 F GLIBC_2.35 _ZGVdN4vv_hypot F @@ -109,6 +114,7 @@ GLIBC_2.35 _ZGVdN8v_coshf F GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8v_expm1f F +GLIBC_2.35 _ZGVdN8v_log10f F GLIBC_2.35 _ZGVdN8v_sinhf F GLIBC_2.35 _ZGVdN8vv_atan2f F GLIBC_2.35 _ZGVdN8vv_hypotf F @@ -120,6 +126,7 @@ GLIBC_2.35 _ZGVeN16v_coshf F GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16v_expm1f F +GLIBC_2.35 _ZGVeN16v_log10f F GLIBC_2.35 _ZGVeN16v_sinhf F GLIBC_2.35 _ZGVeN16vv_atan2f F GLIBC_2.35 _ZGVeN16vv_hypotf F @@ -131,6 +138,7 @@ GLIBC_2.35 _ZGVeN8v_cosh F GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8v_expm1 F +GLIBC_2.35 _ZGVeN8v_log10 F GLIBC_2.35 _ZGVeN8v_sinh F GLIBC_2.35 _ZGVeN8vv_atan2 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 62f2890ab3..64e80ada7a 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -102,6 +102,10 @@ # define __DECL_SIMD_atan2 __DECL_SIMD_x86_64 # undef __DECL_SIMD_atan2f # define __DECL_SIMD_atan2f __DECL_SIMD_x86_64 +# undef __DECL_SIMD_log10 +# define __DECL_SIMD_log10 __DECL_SIMD_x86_64 +# undef __DECL_SIMD_log10f +# define __DECL_SIMD_log10f __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 2269b74d50..f5050c68af 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -50,6 +50,8 @@ !GCC$ builtin (cbrtf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (atan2) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (atan2f) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (log10) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (log10f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -85,3 +87,5 @@ !GCC$ builtin (cbrtf) attributes simd (notinbranch) if('x32') !GCC$ builtin (atan2) attributes simd (notinbranch) if('x32') !GCC$ builtin (atan2f) attributes simd (notinbranch) if('x32') +!GCC$ builtin (log10) attributes simd (notinbranch) if('x32') +!GCC$ builtin (log10f) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 96a40856fa..ba37044e9d 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -35,6 +35,7 @@ libmvec-funcs = \ expm1 \ hypot \ log \ + log10 \ pow \ sin \ sincos \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index f58c98eb45..8beaf0736f 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -22,6 +22,7 @@ libmvec { _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2v_expm1; _ZGVcN4v_expm1; _ZGVdN4v_expm1; _ZGVeN8v_expm1; + _ZGVbN2v_log10; _ZGVcN4v_log10; _ZGVdN4v_log10; _ZGVeN8v_log10; _ZGVbN2v_sinh; _ZGVcN4v_sinh; _ZGVdN4v_sinh; _ZGVeN8v_sinh; _ZGVbN2vv_atan2; _ZGVcN4vv_atan2; _ZGVdN4vv_atan2; _ZGVeN8vv_atan2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; @@ -33,6 +34,7 @@ libmvec { _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4v_expm1f; _ZGVcN8v_expm1f; _ZGVdN8v_expm1f; _ZGVeN16v_expm1f; + _ZGVbN4v_log10f; _ZGVcN8v_log10f; _ZGVdN8v_log10f; _ZGVeN16v_log10f; _ZGVbN4v_sinhf; _ZGVcN8v_sinhf; _ZGVdN8v_sinhf; _ZGVeN16v_sinhf; _ZGVbN4vv_atan2f; _ZGVcN8vv_atan2f; _ZGVdN8vv_atan2f; _ZGVeN16vv_atan2f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index 6f59c61756..b0cd9d60ea 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1641,6 +1641,26 @@ float: 2 float128: 1 ldouble: 1 +Function: "log10_vlen16": +float: 1 + +Function: "log10_vlen2": +double: 1 + +Function: "log10_vlen4": +double: 1 +float: 1 + +Function: "log10_vlen4_avx2": +double: 1 + +Function: "log10_vlen8": +double: 1 +float: 1 + +Function: "log10_vlen8_avx2": +float: 1 + Function: "log1p": double: 1 float: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core-sse2.S new file mode 100644 index 0000000000..e654db6d6c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized log10, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_log10 _ZGVbN2v_log10_sse2 +#include "../svml_d_log102_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core.c new file mode 100644 index 0000000000..1c775f33b6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log10, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_log10 +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_log10, __GI__ZGVbN2v_log10, __redirect__ZGVbN2v_log10) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core_sse4.S new file mode 100644 index 0000000000..208608f622 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log102_core_sse4.S @@ -0,0 +1,1086 @@ +/* Function log10 vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log10(x) = k*log10(2.0) - log10(Rcp) + poly_approximation(R) + * log10(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog10_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 4112 +#define poly_coeff 8224 +#define ExpMask 8304 +#define Two10 8320 +#define MinNorm 8336 +#define MaxNorm 8352 +#define HalfMask 8368 +#define One 8384 +#define Threshold 8400 +#define Bias 8416 +#define Bias1 8432 +#define L2 8448 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_log10_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + +/* exponent bits */ + movaps %xmm0, %xmm5 + +/* preserve mantissa, set input exponent to 2^(-10) */ + movups ExpMask+__svml_dlog10_data_internal(%rip), %xmm1 + psrlq $20, %xmm5 + andps %xmm0, %xmm1 + lea -4222960+__svml_dlog10_data_internal(%rip), %rsi + orps Two10+__svml_dlog10_data_internal(%rip), %xmm1 + +/* check range */ + movaps %xmm0, %xmm8 + +/* reciprocal approximation good to at least 11 bits */ + cvtpd2ps %xmm1, %xmm2 + cmpltpd MinNorm+__svml_dlog10_data_internal(%rip), %xmm8 + movlhps %xmm2, %xmm2 + movaps %xmm0, %xmm7 + rcpps %xmm2, %xmm3 + cmpnlepd MaxNorm+__svml_dlog10_data_internal(%rip), %xmm7 + cvtps2pd %xmm3, %xmm12 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + movups .FLT_12(%rip), %xmm4 + orps %xmm7, %xmm8 + addpd %xmm4, %xmm12 + +/* combine and get argument value range mask */ + movmskpd %xmm8, %edx + +/* argument reduction */ + movups HalfMask+__svml_dlog10_data_internal(%rip), %xmm9 + subpd %xmm4, %xmm12 + andps %xmm1, %xmm9 + +/* + * prepare table index + * table lookup + */ + movaps %xmm12, %xmm10 + subpd %xmm9, %xmm1 + mulpd %xmm12, %xmm9 + mulpd %xmm12, %xmm1 + subpd One+__svml_dlog10_data_internal(%rip), %xmm9 + addpd %xmm9, %xmm1 + +/* polynomial */ + movups poly_coeff+__svml_dlog10_data_internal(%rip), %xmm14 + psrlq $40, %xmm10 + mulpd %xmm1, %xmm14 + movd %xmm10, %eax + pshufd $2, %xmm10, %xmm11 + movaps %xmm1, %xmm10 + movups poly_coeff+32+__svml_dlog10_data_internal(%rip), %xmm15 + mulpd %xmm1, %xmm10 + addpd poly_coeff+16+__svml_dlog10_data_internal(%rip), %xmm14 + mulpd %xmm1, %xmm15 + mulpd %xmm10, %xmm14 + addpd poly_coeff+48+__svml_dlog10_data_internal(%rip), %xmm15 + movd %xmm11, %ecx + +/* exponent*log(2.0) */ + movups Threshold+__svml_dlog10_data_internal(%rip), %xmm13 + addpd %xmm14, %xmm15 + cmpltpd %xmm12, %xmm13 + mulpd %xmm15, %xmm10 + pshufd $221, %xmm5, %xmm6 + movups poly_coeff+64+__svml_dlog10_data_internal(%rip), %xmm11 + +/* biased exponent in DP format */ + cvtdq2pd %xmm6, %xmm3 + mulpd %xmm1, %xmm11 + andps Bias+__svml_dlog10_data_internal(%rip), %xmm13 + orps Bias1+__svml_dlog10_data_internal(%rip), %xmm13 + subpd %xmm13, %xmm3 + addpd %xmm10, %xmm11 + mulpd L2+__svml_dlog10_data_internal(%rip), %xmm3 + movslq %eax, %rax + movslq %ecx, %rcx + movsd (%rsi,%rax), %xmm2 + movhpd (%rsi,%rcx), %xmm2 + +/* reconstruction */ + addpd %xmm11, %xmm2 + addpd %xmm2, %xmm3 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm0 xmm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm3, %xmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm3, 48(%rsp) + # LOE rbx r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm3 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm3 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call log10@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN2v_log10_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dlog10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Log_HA_table[(1<<9)+2][2]; + __declspec(align(16)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(16)) VUINT32 poly_coeff[5][2][2]; + __declspec(align(16)) VUINT32 ExpMask[2][2]; + __declspec(align(16)) VUINT32 Two10[2][2]; + __declspec(align(16)) VUINT32 MinNorm[2][2]; + __declspec(align(16)) VUINT32 MaxNorm[2][2]; + __declspec(align(16)) VUINT32 HalfMask[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 Threshold[2][2]; + __declspec(align(16)) VUINT32 Bias[2][2]; + __declspec(align(16)) VUINT32 Bias1[2][2]; + __declspec(align(16)) VUINT32 L2[2][2]; +} __svml_dlog10_data_internal; +#endif +__svml_dlog10_data_internal: + /* Log_HA_table */ + .quad 0xc0733a7146f6b080, 0xbe1e707ce619c200 + .quad 0xc0733a7547771970, 0xbe1e79c6c06d6f51 + .quad 0xc0733a7945aacb70, 0xbe1e78e225fad29c + .quad 0xc0733a7d41946970, 0xbe1e76d607f9693b + .quad 0xc0733a813b3691f0, 0xbe1e7704b3e0685b + .quad 0xc0733a853293df00, 0xbe1e79c1216a27fa + .quad 0xc0733a8927aee660, 0xbe1e76dce5734a81 + .quad 0xc0733a8d1a8a3920, 0xbe1e782ee2ca4dba + .quad 0xc0733a910b286430, 0xbe1e7812d1a0a61f + .quad 0xc0733a94f98bf010, 0xbe1e77e1b5ecbc61 + .quad 0xc0733a98e5b76100, 0xbe1e76635cac1586 + .quad 0xc0733a9ccfad36f0, 0xbe1e7638f7968f32 + .quad 0xc0733aa0b76feda0, 0xbe1e7840ee76e365 + .quad 0xc0733aa49d01fcb0, 0xbe1e79f3fd01907e + .quad 0xc0733aa88065d7a0, 0xbe1e77bbb3a9c38a + .quad 0xc0733aac619dedb0, 0xbe1e7742719bf41d + .quad 0xc0733ab040acaa20, 0xbe1e79bcedaf79cb + .quad 0xc0733ab41d947450, 0xbe1e762d63cb7ca0 + .quad 0xc0733ab7f857af50, 0xbe1e77a07be83403 + .quad 0xc0733abbd0f8ba80, 0xbe1e7763ff836ad0 + .quad 0xc0733abfa779f130, 0xbe1e7737720ead39 + .quad 0xc0733ac37bddaad0, 0xbe1e7776a08e55e7 + .quad 0xc0733ac74e263af0, 0xbe1e793e3c52dd36 + .quad 0xc0733acb1e55f160, 0xbe1e788a94695051 + .quad 0xc0733aceec6f1a10, 0xbe1e76508114a813 + .quad 0xc0733ad2b873fd20, 0xbe1e76909457d23e + .quad 0xc0733ad68266df10, 0xbe1e7664a24f9ca4 + .quad 0xc0733ada4a4a0090, 0xbe1e7a07b3d44b18 + .quad 0xc0733ade101f9ee0, 0xbe1e76d87594704d + .quad 0xc0733ae1d3e9f340, 0xbe1e79563595a182 + .quad 0xc0733ae595ab33b0, 0xbe1e771880c3c6ab + .quad 0xc0733ae955659250, 0xbe1e78c171f517d4 + .quad 0xc0733aed131b3df0, 0xbe1e77eac3874666 + .quad 0xc0733af0cece61b0, 0xbe1e790db479d8f6 + .quad 0xc0733af488812550, 0xbe1e7965d1aa5c90 + .quad 0xc0733af84035ad10, 0xbe1e78ceb398ba47 + .quad 0xc0733afbf5ee19c0, 0xbe1e779cc0dcb5aa + .quad 0xc0733affa9ac88c0, 0xbe1e7871053953ed + .quad 0xc0733b035b731420, 0xbe1e7a082cffa71a + .quad 0xc0733b070b43d2a0, 0xbe1e7904b4382fad + .quad 0xc0733b0ab920d790, 0xbe1e79b458d0b4f3 + .quad 0xc0733b0e650c3310, 0xbe1e79d0ded414c6 + .quad 0xc0733b120f07f200, 0xbe1e763c357a1943 + .quad 0xc0733b15b7161dd0, 0xbe1e78b80ba6daaa + .quad 0xc0733b195d38bd00, 0xbe1e7998e23b8ffd + .quad 0xc0733b1d0171d2c0, 0xbe1e7974aa65ee8c + .quad 0xc0733b20a3c35f20, 0xbe1e76ccfde752ab + .quad 0xc0733b24442f5ef0, 0xbe1e77b4ff19debb + .quad 0xc0733b27e2b7cc10, 0xbe1e7772ee478542 + .quad 0xc0733b2b7f5e9d30, 0xbe1e781d81b58b44 + .quad 0xc0733b2f1a25c600, 0xbe1e78350d967565 + .quad 0xc0733b32b30f3720, 0xbe1e783888e48152 + .quad 0xc0733b364a1cde30, 0xbe1e78367bf7c111 + .quad 0xc0733b39df50a5d0, 0xbe1e7959e57ca47d + .quad 0xc0733b3d72ac75c0, 0xbe1e777322423222 + .quad 0xc0733b41043232b0, 0xbe1e767ce42a60aa + .quad 0xc0733b4493e3be70, 0xbe1e781d445aea19 + .quad 0xc0733b4821c2f800, 0xbe1e7922fca18e18 + .quad 0xc0733b4badd1bb80, 0xbe1e76fed3d40647 + .quad 0xc0733b4f3811e210, 0xbe1e793948c9eabc + .quad 0xc0733b52c0854240, 0xbe1e76e487656b8c + .quad 0xc0733b56472daf90, 0xbe1e780ab2f71223 + .quad 0xc0733b59cc0cfaf0, 0xbe1e77189120b09c + .quad 0xc0733b5d4f24f270, 0xbe1e7644a0343a12 + .quad 0xc0733b60d0776160, 0xbe1e78f2a3e4733d + .quad 0xc0733b6450061080, 0xbe1e7913b2f73ae5 + .quad 0xc0733b67cdd2c5c0, 0xbe1e7882d08393b5 + .quad 0xc0733b6b49df4470, 0xbe1e765e1b209979 + .quad 0xc0733b6ec42d4d20, 0xbe1e785c9c4620d4 + .quad 0xc0733b75b394f240, 0xbe1e78878cd0e956 + .quad 0xc0733b7c9c178630, 0xbe1e789a4112d90b + .quad 0xc0733b837dc2b0f0, 0xbe1e79050b8a1766 + .quad 0xc0733b8a58a3f220, 0xbe1e7790dffc47aa + .quad 0xc0733b912cc8a180, 0xbe1e77174593b06a + .quad 0xc0733b97fa3defb0, 0xbe1e7677de2d2ecc + .quad 0xc0733b9ec110e6b0, 0xbe1e76cff477ca18 + .quad 0xc0733ba5814e6a80, 0xbe1e78f8644dec7b + .quad 0xc0733bac3b0339d0, 0xbe1e764e1361788d + .quad 0xc0733bb2ee3bee30, 0xbe1e78c913e738de + .quad 0xc0733bb99b04fd30, 0xbe1e76666f5bddaa + .quad 0xc0733bc0416ab850, 0xbe1e77e87cbd8ab6 + .quad 0xc0733bc6e1794e10, 0xbe1e76f18ba1c966 + .quad 0xc0733bcd7b3cca10, 0xbe1e777c9461b8db + .quad 0xc0733bd40ec115d0, 0xbe1e78b78526ffac + .quad 0xc0733bda9c11f920, 0xbe1e7942abecfede + .quad 0xc0733be1233b1aa0, 0xbe1e76d8a684fd8c + .quad 0xc0733be7a4480010, 0xbe1e79622b539ac9 + .quad 0xc0733bee1f440f30, 0xbe1e7978e7cc20ea + .quad 0xc0733bf4943a8de0, 0xbe1e765c9c9de825 + .quad 0xc0733bfb0336a290, 0xbe1e775d8b138ee2 + .quad 0xc0733c016c435500, 0xbe1e78bf33465c2f + .quad 0xc0733c07cf6b8e80, 0xbe1e78164f7cc441 + .quad 0xc0733c0e2cba1a50, 0xbe1e7824e64d0b23 + .quad 0xc0733c148439a630, 0xbe1e78373ae7dd81 + .quad 0xc0733c1ad5f4c2c0, 0xbe1e7704513e0afe + .quad 0xc0733c2121f5e3d0, 0xbe1e7914aa84200f + .quad 0xc0733c2768476110, 0xbe1e76b1cde25cf6 + .quad 0xc0733c2da8f37600, 0xbe1e796120e3862d + .quad 0xc0733c33e40442e0, 0xbe1e78ec836d7e7b + .quad 0xc0733c3a1983cca0, 0xbe1e77fb13b7dabb + .quad 0xc0733c40497bfd70, 0xbe1e783c6fcb2404 + .quad 0xc0733c4673f6a530, 0xbe1e7628bb93dce8 + .quad 0xc0733c4c98fd7990, 0xbe1e7857a47b5001 + .quad 0xc0733c52b89a16d0, 0xbe1e76708dc2831f + .quad 0xc0733c58d2d5ffa0, 0xbe1e77b6038651f1 + .quad 0xc0733c5ee7ba9de0, 0xbe1e792e855bb5b2 + .quad 0xc0733c64f75142d0, 0xbe1e776cacd5c105 + .quad 0xc0733c6b01a32740, 0xbe1e77f8a8011315 + .quad 0xc0733c7106b96c30, 0xbe1e765cf3efcfde + .quad 0xc0733c77069d1ad0, 0xbe1e78d837d2efac + .quad 0xc0733c7d01572530, 0xbe1e78b615cf772c + .quad 0xc0733c82f6f06640, 0xbe1e7650bbbd7a25 + .quad 0xc0733c88e771a220, 0xbe1e78bcf3495872 + .quad 0xc0733c8ed2e386c0, 0xbe1e792266832e84 + .quad 0xc0733c94b94eabd0, 0xbe1e79c1c3c2ca52 + .quad 0xc0733c9a9abb9340, 0xbe1e78aa61e5807d + .quad 0xc0733ca07732a970, 0xbe1e7620fc4cf156 + .quad 0xc0733ca64ebc4570, 0xbe1e76b914a832c5 + .quad 0xc0733cac2160a970, 0xbe1e79227f72020e + .quad 0xc0733cb1ef280300, 0xbe1e77ac972cc008 + .quad 0xc0733cb7b81a6b10, 0xbe1e798089be41f4 + .quad 0xc0733cbd7c3fe6a0, 0xbe1e77942ae037fe + .quad 0xc0733cc33ba06690, 0xbe1e7956ae6463d9 + .quad 0xc0733cc8f643c850, 0xbe1e7918a50c7942 + .quad 0xc0733cceac31d5d0, 0xbe1e78308eeab604 + .quad 0xc0733cd45d7245e0, 0xbe1e76dd4ea88445 + .quad 0xc0733cda0a0cbc60, 0xbe1e77e7c1aa5909 + .quad 0xc0733cdfb208caa0, 0xbe1e7804b9d20e54 + .quad 0xc0733ce5556def70, 0xbe1e78f88e99d49c + .quad 0xc0733ceaf4439780, 0xbe1e787d74682d68 + .quad 0xc0733cf08e911d80, 0xbe1e76edc24fe6e7 + .quad 0xc0733cf6245dca50, 0xbe1e79b347ec86d2 + .quad 0xc0733cfbb5b0d580, 0xbe1e797cceb2c39b + .quad 0xc0733d0142916530, 0xbe1e783adbdc6aa1 + .quad 0xc0733d06cb068e70, 0xbe1e76e4c20e3d9e + .quad 0xc0733d0c4f175570, 0xbe1e77070bf3cf61 + .quad 0xc0733d11cecaadc0, 0xbe1e781c43502734 + .quad 0xc0733d174a277a80, 0xbe1e78b11268ea72 + .quad 0xc0733d1cc1348e90, 0xbe1e7754b83bfc7d + .quad 0xc0733d2233f8acb0, 0xbe1e7756c29bf5e9 + .quad 0xc0733d27a27a87d0, 0xbe1e7952fc1d9333 + .quad 0xc0733d2d0cc0c350, 0xbe1e778c76ae6077 + .quad 0xc0733d3272d1f2e0, 0xbe1e7a1896ba8f43 + .quad 0xc0733d37d4b49b30, 0xbe1e76dafdf432d8 + .quad 0xc0733d3d326f3180, 0xbe1e795330184013 + .quad 0xc0733d428c081c80, 0xbe1e763cc774d30f + .quad 0xc0733d47e185b3d0, 0xbe1e77030a779c0a + .quad 0xc0733d4d32ee40b0, 0xbe1e7908af2a2d7e + .quad 0xc0733d528047fe00, 0xbe1e78c4953b797d + .quad 0xc0733d57c9991850, 0xbe1e78b43b096579 + .quad 0xc0733d5d0ee7ae30, 0xbe1e7824ae0a4804 + .quad 0xc0733d625039d040, 0xbe1e79d2b2fbb740 + .quad 0xc0733d678d958190, 0xbe1e7662de59a1a6 + .quad 0xc0733d6cc700b760, 0xbe1e76b251d59aaa + .quad 0xc0733d71fc8159b0, 0xbe1e7a00cfd1f487 + .quad 0xc0733d772e1d4360, 0xbe1e77f4d246167e + .quad 0xc0733d7c5bda4200, 0xbe1e767a4ee8e6fc + .quad 0xc0733d8185be1640, 0xbe1e777ccf0a8aed + .quad 0xc0733d86abce7420, 0xbe1e767d7e279ada + .quad 0xc0733d8bce1102d0, 0xbe1e7a05cef4bb90 + .quad 0xc0733d90ec8b5d40, 0xbe1e78f75369be5b + .quad 0xc0733d96074311d0, 0xbe1e77b9612e8c8a + .quad 0xc0733d9b1e3da2b0, 0xbe1e794518b9adeb + .quad 0xc0733da031808620, 0xbe1e7810626fb934 + .quad 0xc0733da541112650, 0xbe1e76d87223fa6d + .quad 0xc0733daa4cf4e1a0, 0xbe1e794c5e7ca3b5 + .quad 0xc0733daf55310af0, 0xbe1e789856ef816f + .quad 0xc0733db459cae970, 0xbe1e77d2004effbd + .quad 0xc0733db95ac7b8f0, 0xbe1e78467d31eb9c + .quad 0xc0733dbe582caa00, 0xbe1e79aaa4e25787 + .quad 0xc0733dc351fee220, 0xbe1e762de8f107bf + .quad 0xc0733dc848437b90, 0xbe1e7670670a63fe + .quad 0xc0733dcd3aff85d0, 0xbe1e795ca237c6cc + .quad 0xc0733dd22a3805b0, 0xbe1e77e55c53c1d9 + .quad 0xc0733dd715f1f520, 0xbe1e78a806213ac4 + .quad 0xc0733ddbfe3243b0, 0xbe1e77743a2bc615 + .quad 0xc0733de0e2fdd660, 0xbe1e78b8b45b0b7d + .quad 0xc0733de5c4598800, 0xbe1e78d635f2f4b9 + .quad 0xc0733deaa24a2920, 0xbe1e7758c396a11e + .quad 0xc0733def7cd48020, 0xbe1e7a17a8cc454c + .quad 0xc0733df453fd49a0, 0xbe1e783caa73f616 + .quad 0xc0733df927c93820, 0xbe1e7932cfa29664 + .quad 0xc0733dfdf83cf490, 0xbe1e777d265c72a6 + .quad 0xc0733e02c55d1e10, 0xbe1e7775e7c03c60 + .quad 0xc0733e078f2e4a40, 0xbe1e79f65d52d232 + .quad 0xc0733e0c55b50570, 0xbe1e76e7e7464b4e + .quad 0xc0733e1118f5d250, 0xbe1e77be81cad877 + .quad 0xc0733e15d8f52a80, 0xbe1e79dd25b5fb3a + .quad 0xc0733e1a95b77e80, 0xbe1e78e45f1418ef + .quad 0xc0733e1f4f4135a0, 0xbe1e78eb7289505b + .quad 0xc0733e240596ae50, 0xbe1e78a468c07cad + .quad 0xc0733e28b8bc3e20, 0xbe1e776b558a4009 + .quad 0xc0733e2d68b631d0, 0xbe1e77412eb9941e + .quad 0xc0733e321588cd80, 0xbe1e76b2853f845e + .quad 0xc0733e36bf384cb0, 0xbe1e76aa7184273c + .quad 0xc0733e3b65c8e260, 0xbe1e7832027f78fa + .quad 0xc0733e40093eb930, 0xbe1e7a1c7da131f5 + .quad 0xc0733e44a99df380, 0xbe1e76a0bc2ae4bc + .quad 0xc0733e4946eaab30, 0xbe1e78dff13b6f5d + .quad 0xc0733e4de128f250, 0xbe1e765a226dea2c + .quad 0xc0733e52785cd290, 0xbe1e78509b989111 + .quad 0xc0733e570c8a4de0, 0xbe1e7916a4e9803d + .quad 0xc0733e5b9db55e30, 0xbe1e7950c15758cc + .quad 0xc0733e602be1f5a0, 0xbe1e7922ba1ad420 + .quad 0xc0733e64b713fe90, 0xbe1e794cbaabcef6 + .quad 0xc0733e693f4f5bc0, 0xbe1e7837bf883fed + .quad 0xc0733e6dc497e850, 0xbe1e76f198ddbbdf + .quad 0xc0733e7246f177d0, 0xbe1e7a18c1067764 + .quad 0xc0733e76c65fd6a0, 0xbe1e76b845a8fd9d + .quad 0xc0733e7b42e6c970, 0xbe1e7714012df506 + .quad 0xc0733e7fbc8a0de0, 0xbe1e7765612922cd + .quad 0xc0733e84334d5a50, 0xbe1e7688f5424a00 + .quad 0xc0733e88a7345df0, 0xbe1e769d011f6663 + .quad 0xc0733e8d1842c0e0, 0xbe1e79914acbfaf7 + .quad 0xc0733e91867c2460, 0xbe1e79a85e189bd7 + .quad 0xc0733e95f1e422a0, 0xbe1e79ea7c726432 + .quad 0xc0733e9a5a7e4f10, 0xbe1e768a6fbb8e6e + .quad 0xc0733e9ec04e3620, 0xbe1e793c75bcc9fc + .quad 0xc0733ea323575dd0, 0xbe1e797f78da13d4 + .quad 0xc0733ea7839d4550, 0xbe1e78d8c9cda978 + .quad 0xc0733eabe1236540, 0xbe1e77028d480fff + .quad 0xc0733eb03bed2fa0, 0xbe1e7a0d0f74ff7c + .quad 0xc0733eb493fe1040, 0xbe1e76732e8a35fb + .quad 0xc0733eb8e9596c30, 0xbe1e77220caeabeb + .quad 0xc0733ebd3c02a260, 0xbe1e797438b645ef + .quad 0xc0733ec18bfd0b80, 0xbe1e79207c5fd6e8 + .quad 0xc0733ec5d94bf9f0, 0xbe1e781c7df8f946 + .quad 0xc0733eca23f2b9f0, 0xbe1e76736284e2db + .quad 0xc0733ece6bf49190, 0xbe1e7a109cc0c3f5 + .quad 0xc0733ed2b154c120, 0xbe1e767f14a16d50 + .quad 0xc0733ed6f4168290, 0xbe1e789cd22acaf0 + .quad 0xc0733edb343d0a40, 0xbe1e764355ca28ad + .quad 0xc0733edf71cb8660, 0xbe1e79e4c7a81c45 + .quad 0xc0733ee3acc51fb0, 0xbe1e761e26b644c2 + .quad 0xc0733ee7e52cf8c0, 0xbe1e793e9f8fbdd3 + .quad 0xc0733eec1b062ed0, 0xbe1e78c432991c20 + .quad 0xc0733ef04e53d940, 0xbe1e78cdd025f4d8 + .quad 0xc0733ef47f1909f0, 0xbe1e778310c6446e + .quad 0xc0733ef8ad58cd20, 0xbe1e7871af3d6e17 + .quad 0xc0733efcd91629b0, 0xbe1e77e0e906f697 + .quad 0xc0733f01025420f0, 0xbe1e7a1ae9b27892 + .quad 0xc0733f052915af00, 0xbe1e76ac64c88f9d + .quad 0xc0733f094d5dca60, 0xbe1e779a815589c4 + .quad 0xc0733f0d6f2f6480, 0xbe1e788f39a4864c + .quad 0xc0733f118e8d6980, 0xbe1e79fc51263525 + .quad 0xc0733f15ab7ac060, 0xbe1e783501f19e90 + .quad 0xc0733f19c5fa4ae0, 0xbe1e767e82c327ab + .quad 0xc0733f1dde0ee5a0, 0xbe1e7a1785d66123 + .quad 0xc0733f21f3bb6870, 0xbe1e7936d07203da + .quad 0xc0733f260702a5e0, 0xbe1e7a010a7ac699 + .quad 0xc0733f2a17e76bb0, 0xbe1e7975e4e16312 + .quad 0xc0733f2e266c82b0, 0xbe1e7654b5422330 + .quad 0xc0733f323294aeb0, 0xbe1e77f8a4909d35 + .quad 0xc0733f363c62aee0, 0xbe1e792c8e30d226 + .quad 0xc0733f3a43d93da0, 0xbe1e76f6ac67a1ff + .quad 0xc0733f3e48fb1070, 0xbe1e775c2e97715a + .quad 0xc0733f424bcad840, 0xbe1e781cd54ae100 + /*== Log_LA_table ==*/ + .align 16 + .quad 0x0000000000000000 + .quad 0xbf4bc48a867884b7 + .quad 0xbf5bbd9e9482af09 + .quad 0xbf64c9096b94befd + .quad 0xbf6bafd47221ed26 + .quad 0xbf714999e2ad8ea6 + .quad 0xbf74b99563d2a1bd + .quad 0xbf7827de6b310350 + .quad 0xbf7b9476a4fcd10f + .quad 0xbf7eff5fbaf25781 + .quad 0xbf81344daa2d7553 + .quad 0xbf82e8158b08d957 + .quad 0xbf849b0851443684 + .quad 0xbf864d26cce610dd + .quad 0xbf87fe71ccc4e6b0 + .quad 0xbf89aeea1e897fdf + .quad 0xbf8b5e908eb13790 + .quad 0xbf8d0d65e890405a + .quad 0xbf8ebb6af653e2ee + .quad 0xbf90345040825bad + .quad 0xbf910a83a8446c78 + .quad 0xbf91e05015d30a71 + .quad 0xbf92b5b5ec0209d3 + .quad 0xbf938ab58d173e91 + .quad 0xbf945f4f5acb8be0 + .quad 0xbf953383b64bf13f + .quad 0xbf960753003a94ef + .quad 0xbf96dabd98afcc05 + .quad 0xbf97adc3df3b1ff8 + .quad 0xbf98806632e451d0 + .quad 0xbf9952a4f22c5ae9 + .quad 0xbf9a24807b0e6b5c + .quad 0xbf9af5f92b00e610 + .quad 0xbf9bc70f5ef65a77 + .quad 0xbf9c97c3735e7c0a + .quad 0xbf9d6815c4271775 + .quad 0xbf9e3806acbd058f + .quad 0xbf9f0796880d1c19 + .quad 0xbf9fd6c5b0851c4c + .quad 0xbfa052ca400a4f9b + .quad 0xbfa0ba01a8170000 + .quad 0xbfa121093ce3a205 + .quad 0xbfa187e12aad8077 + .quad 0xbfa1ee899d74a03e + .quad 0xbfa25502c0fc314c + .quad 0xbfa2bb4cc0cafe8d + .quad 0xbfa32167c82bdcda + .quad 0xbfa38754022e18e2 + .quad 0xbfa3ed1199a5e425 + .quad 0xbfa452a0b92cc0ec + .quad 0xbfa4b8018b21ed4f + .quad 0xbfa51d3439aacd4a + .quad 0xbfa58238eeb353da + .quad 0xbfa5e70fd3ee6b34 + .quad 0xbfa64bb912d65c07 + .quad 0xbfa6b034d4ad33df + .quad 0xbfa71483427d2a99 + .quad 0xbfa778a4851906f3 + .quad 0xbfa7dc98c51c8242 + .quad 0xbfa840602aecab3d + .quad 0xbfa8a3fadeb847f4 + .quad 0xbfa90769087836e4 + .quad 0xbfa96aaacfefcf3c + .quad 0xbfa9cdc05cad4042 + .quad 0xbfaa30a9d609efea + .quad 0xbfaa9367632ad897 + .quad 0xbfaaf5f92b00e610 + .quad 0xbfab585f544951a4 + .quad 0xbfabba9a058dfd84 + .quad 0xbfac1ca96525cf56 + .quad 0xbfac7e8d993509f9 + .quad 0xbface046c7ada68d + .quad 0xbfad41d5164facb4 + .quad 0xbfada338aaa98a0c + .quad 0xbfae0471aa1868f5 + .quad 0xbfae658039c88690 + .quad 0xbfaec6647eb58808 + .quad 0xbfaf271e9daacf20 + .quad 0xbfaf87aebb43ce06 + .quad 0xbfafe814fbec5a77 + .quad 0xbfb02428c1f08016 + .quad 0xbfb054323b97a948 + .quad 0xbfb08426fcdb1ee7 + .quad 0xbfb0b40717932b96 + .quad 0xbfb0e3d29d81165e + .quad 0xbfb11389a04f4a2e + .quad 0xbfb1432c31917d08 + .quad 0xbfb172ba62c4d6de + .quad 0xbfb1a23445501816 + .quad 0xbfb1d199ea83bfbe + .quad 0xbfb200eb639a3173 + .quad 0xbfb23028c1b7daed + .quad 0xbfb25f5215eb594a + .quad 0xbfb28e67712d9dfc + .quad 0xbfb2bd68e4621371 + .quad 0xbfb2ec568056c16f + .quad 0xbfb31b3055c47118 + .quad 0xbfb349f6754ed0b4 + .quad 0xbfb378a8ef84971e + .quad 0xbfb3a747d4dfa6f5 + .quad 0xbfb3d5d335c53179 + .quad 0xbfb4044b2285d925 + .quad 0xbfb432afab5dd3ff + .quad 0xbfb46100e0750da1 + .quad 0xbfb48f3ed1df48fb + .quad 0xbfb4bd698f9c41cf + .quad 0xbfb4eb812997cde4 + .quad 0xbfb51985afa9fdfd + .quad 0xbfb5477731973e85 + .quad 0xbfb57555bf1077f5 + .quad 0xbfb5a32167b32f02 + .quad 0xbfb5d0da3b09a47e + .quad 0xbfb5fe80488af4fd + .quad 0xbfb62c139f9b3837 + .quad 0xbfb659944f8ba02d + .quad 0xbfb68702679a980a + .quad 0xbfb6b45df6f3e2c9 + .quad 0xbfb6e1a70cb0b99a + .quad 0xbfb70eddb7d7ea07 + .quad 0xbfb73c02075df3e5 + .quad 0xbfb769140a2526fd + .quad 0xbfb79613cefdc07d + .quad 0xbfb7c30164a60836 + .quad 0xbfb7efdcd9ca6d8f + .quad 0xbfb81ca63d05a44a + .quad 0xbfb8495d9ce0c10c + .quad 0xbfb8760307d355ab + .quad 0xbfb8a2968c438d41 + .quad 0xbfb8cf183886480d + .quad 0xbfb8fb881adf3713 + .quad 0xbfb927e64180f790 + .quad 0xbfb95432ba8d2e2f + .quad 0xbfb9806d9414a209 + .quad 0xbfb9ac96dc175776 + .quad 0xbfb9d8aea084aa9c + .quad 0xbfba04b4ef3b69d8 + .quad 0xbfba30a9d609efea + .quad 0xbfba5c8d62ae3dec + .quad 0xbfba885fa2d6151e + .quad 0xbfbab420a41f1076 + .quad 0xbfbadfd07416be07 + .quad 0xbfbb0b6f203ab82c + .quad 0xbfbb36fcb5f8be8a + .quad 0xbfbb627942aecedd + .quad 0xbfbb8de4d3ab3d98 + .quad 0xbfbbb93f762cce4f + .quad 0xbfbbe4893762cbf7 + .quad 0xbfbc0fc2246d20f5 + .quad 0xbfbc3aea4a5c6eff + .quad 0xbfbc6601b63226cb + .quad 0xbfbc910874e09f98 + .quad 0xbfbcbbfe934b2e81 + .quad 0xbfbce6e41e463da5 + .quad 0xbfbd11b92297632b + .quad 0xbfbd3c7dacf5780b + .quad 0xbfbd6731ca08aeb9 + .quad 0xbfbd91d5866aa99c + .quad 0xbfbdbc68eea6915b + .quad 0xbfbde6ec0f392b05 + .quad 0xbfbe115ef490ee07 + .quad 0xbfbe3bc1ab0e19fe + .quad 0xbfbe66143f02cc5d + .quad 0xbfbe9056bcb315e8 + .quad 0xbfbeba893055100b + .quad 0xbfbee4aba610f204 + .quad 0xbfbf0ebe2a0125eb + .quad 0xbfbf38c0c8325d86 + .quad 0xbfbf62b38ca3a706 + .quad 0xbfbf8c9683468191 + .quad 0xbfbfb669b7fef1a8 + .quad 0xbfbfe02d36a3956d + .quad 0xbfc004f0857edc5c + .quad 0xbfc019c2a064b486 + .quad 0xbfc02e8cf1dac4b8 + .quad 0xbfc0434f7fb1f307 + .quad 0xbfc0580a4fb4a3df + .quad 0xbfc06cbd67a6c3b6 + .quad 0xbfc08168cd45d0a9 + .quad 0xbfc0960c8648e406 + .quad 0xbfc0aaa89860bbcf + .quad 0xbfc0bf3d0937c41c + .quad 0xbfc0d3c9de722078 + .quad 0xbfc0e84f1dadb526 + .quad 0xbfc0fccccc823059 + .quad 0xbfc11142f0811357 + .quad 0xbfc125b18f35bb8e + .quad 0xbfc13a18ae256b99 + .quad 0xbfc14e7852cf5430 + .quad 0xbfc162d082ac9d10 + .quad 0xbfc1772143306dc6 + .quad 0xbfc18b6a99c7f679 + .quad 0xbfc19fac8bda7897 + .quad 0xbfc1b3e71ec94f7b + .quad 0xbfc1c81a57eff8fd + .quad 0xbfc1dc463ca41df8 + .quad 0xbfc1f06ad2359abd + .quad 0xbfc204881dee8777 + .quad 0xbfc2189e25134081 + .quad 0xbfc22cacece26ead + .quad 0xbfc240b47a950f79 + .quad 0xbfc254b4d35e7d3c + .quad 0xbfc268adfc6c773e + .quad 0xbfc27c9ffae729c1 + .quad 0xbfc2908ad3f13603 + .quad 0xbfc2a46e8ca7ba2a + .quad 0xbfc2b84b2a225923 + .quad 0xbfc2cc20b1734279 + .quad 0xbfc2dfef27a73a18 + .quad 0xbfc2f3b691c5a001 + .quad 0xbfc30776f4d077f7 + .quad 0xbfc31b3055c47118 + .quad 0xbfc32ee2b998ed6e + .quad 0xbfc3428e2540096d + .quad 0x3fc331f403985097 + .quad 0x3fc31e56798a910a + .quad 0x3fc30abfd8f333b6 + .quad 0x3fc2f7301cf4e87b + .quad 0x3fc2e3a740b7800f + .quad 0x3fc2d0253f67e4cb + .quad 0x3fc2bcaa14381386 + .quad 0x3fc2a935ba5f1479 + .quad 0x3fc295c82d18f434 + .quad 0x3fc2826167a6bc9c + .quad 0x3fc26f01654e6df6 + .quad 0x3fc25ba8215af7fc + .quad 0x3fc24855971c3307 + .quad 0x3fc23509c1e6d937 + .quad 0x3fc221c49d147fb3 + .quad 0x3fc20e8624038fed + .quad 0x3fc1fb4e521740f4 + .quad 0x3fc1e81d22b790d4 + .quad 0x3fc1d4f291513e01 + .quad 0x3fc1c1ce9955c0c6 + .quad 0x3fc1aeb1363b44c8 + .quad 0x3fc19b9a637ca295 + .quad 0x3fc1888a1c995931 + .quad 0x3fc175805d1587c1 + .quad 0x3fc1627d2079e731 + .quad 0x3fc14f806253c3ed + .quad 0x3fc13c8a1e34f7a0 + .quad 0x3fc1299a4fb3e306 + .quad 0x3fc116b0f26b67bb + .quad 0x3fc103ce01fae223 + .quad 0x3fc0f0f17a062353 + .quad 0x3fc0de1b56356b04 + .quad 0x3fc0cb4b9235619a + .quad 0x3fc0b88229b71227 + .quad 0x3fc0a5bf186fe483 + .quad 0x3fc093025a19976c + .quad 0x3fc0804bea723aa9 + .quad 0x3fc06d9bc53c2941 + .quad 0x3fc05af1e63e03b4 + .quad 0x3fc0484e4942aa43 + .quad 0x3fc035b0ea19373b + .quad 0x3fc02319c494f951 + .quad 0x3fc01088d48d6e03 + .quad 0x3fbffbfc2bbc7803 + .quad 0x3fbfd6f308ce5b52 + .quad 0x3fbfb1f6381856f4 + .quad 0x3fbf8d05b16a6d47 + .quad 0x3fbf68216c9cc727 + .quad 0x3fbf4349618fa91a + .quad 0x3fbf1e7d882b689a + .quad 0x3fbef9bdd860616b + .quad 0x3fbed50a4a26eafc + .quad 0x3fbeb062d57f4de8 + .quad 0x3fbe8bc77271b97a + .quad 0x3fbe6738190e394c + .quad 0x3fbe42b4c16caaf3 + .quad 0x3fbe1e3d63acb3ba + .quad 0x3fbdf9d1f7f5b674 + .quad 0x3fbdd5727676c959 + .quad 0x3fbdb11ed766abf4 + .quad 0x3fbd8cd71303bd26 + .quad 0x3fbd689b2193f133 + .quad 0x3fbd446afb64c7e5 + .quad 0x3fbd204698cb42bd + .quad 0x3fbcfc2df223db2d + .quad 0x3fbcd820ffd278f3 + .quad 0x3fbcb41fba42686d + .quad 0x3fbc902a19e65111 + .quad 0x3fbc6c4017382bea + .quad 0x3fbc4861aab93a23 + .quad 0x3fbc248eccf1fba6 + .quad 0x3fbc00c7767225cb + .quad 0x3fbbdd0b9fd09a10 + .quad 0x3fbbb95b41ab5ce6 + .quad 0x3fbb95b654a78c87 + .quad 0x3fbb721cd17157e3 + .quad 0x3fbb4e8eb0bbf58f + .quad 0x3fbb2b0beb419ad0 + .quad 0x3fbb079479c372ad + .quad 0x3fbae4285509950b + .quad 0x3fbac0c775e2fde6 + .quad 0x3fba9d71d5258484 + .quad 0x3fba7a276badd2c8 + .quad 0x3fba56e8325f5c87 + .quad 0x3fba33b4222456f1 + .quad 0x3fba108b33edb005 + .quad 0x3fb9ed6d60b30612 + .quad 0x3fb9ca5aa1729f45 + .quad 0x3fb9a752ef316149 + .quad 0x3fb9845642fac8f0 + .quad 0x3fb9616495e0e1e8 + .quad 0x3fb93e7de0fc3e80 + .quad 0x3fb91ba21d6bef77 + .quad 0x3fb8f8d144557bdf + .quad 0x3fb8d60b4ee4d901 + .quad 0x3fb8b350364c6257 + .quad 0x3fb8909ff3c4d191 + .quad 0x3fb86dfa808d36a0 + .quad 0x3fb84b5fd5eaefd8 + .quad 0x3fb828cfed29a215 + .quad 0x3fb8064abf9b30f1 + .quad 0x3fb7e3d04697b704 + .quad 0x3fb7c1607b7d7e32 + .quad 0x3fb79efb57b0f803 + .quad 0x3fb77ca0d49cb608 + .quad 0x3fb75a50ebb1624a + .quad 0x3fb7380b9665b7c8 + .quad 0x3fb715d0ce367afc + .quad 0x3fb6f3a08ca67270 + .quad 0x3fb6d17acb3e5f5e + .quad 0x3fb6af5f838cf654 + .quad 0x3fb68d4eaf26d7ee + .quad 0x3fb66b4847a68997 + .quad 0x3fb6494c46ac6e4d + .quad 0x3fb6275aa5debf81 + .quad 0x3fb605735ee985f1 + .quad 0x3fb5e3966b7e9295 + .quad 0x3fb5c1c3c5557799 + .quad 0x3fb59ffb662b815c + .quad 0x3fb57e3d47c3af7b + .quad 0x3fb55c8963e6adeb + .quad 0x3fb53adfb462ce16 + .quad 0x3fb51940330c000b + .quad 0x3fb4f7aad9bbcbaf + .quad 0x3fb4d61fa2514a00 + .quad 0x3fb4b49e86b11e5f + .quad 0x3fb4932780c56fe2 + .quad 0x3fb471ba8a7de2b7 + .quad 0x3fb450579dcf9186 + .quad 0x3fb42efeb4b506e9 + .quad 0x3fb40dafc92e36e2 + .quad 0x3fb3ec6ad5407868 + .quad 0x3fb3cb2fd2f67ef1 + .quad 0x3fb3a9febc60540a + .quad 0x3fb388d78b9350ff + .quad 0x3fb367ba3aaa1883 + .quad 0x3fb346a6c3c49066 + .quad 0x3fb3259d2107db54 + .quad 0x3fb3049d4c9e52a0 + .quad 0x3fb2e3a740b7800f + .quad 0x3fb2c2baf78817b7 + .quad 0x3fb2a1d86b49f1e2 + .quad 0x3fb280ff963c04fc + .quad 0x3fb2603072a25f82 + .quad 0x3fb23f6afac6220a + .quad 0x3fb21eaf28f57941 + .quad 0x3fb1fdfcf7839804 + .quad 0x3fb1dd5460c8b16f + .quad 0x3fb1bcb55f21f307 + .quad 0x3fb19c1fecf17ee0 + .quad 0x3fb17b94049e65d0 + .quad 0x3fb15b11a094a1aa + .quad 0x3fb13a98bb450f81 + .quad 0x3fb11a294f2569f6 + .quad 0x3fb0f9c356b04389 + .quad 0x3fb0d966cc6500fa + .quad 0x3fb0b913aac7d3a7 + .quad 0x3fb098c9ec61b3ff + .quad 0x3fb078898bc05bf4 + .quad 0x3fb0585283764178 + .quad 0x3fb03824ce1a9101 + .quad 0x3fb0180066492817 + .quad 0x3fafefca8d451fd6 + .quad 0x3fafafa6d397efdb + .quad 0x3faf6f9594de60f0 + .quad 0x3faf2f96c6754aee + .quad 0x3faeefaa5dc2b239 + .quad 0x3faeafd05035bd3b + .quad 0x3fae70089346a9e6 + .quad 0x3fae30531c76c34a + .quad 0x3fadf0afe1505738 + .quad 0x3fadb11ed766abf4 + .quad 0x3fad719ff455f5f7 + .quad 0x3fad32332dc34dbd + .quad 0x3facf2d8795ca5a5 + .quad 0x3facb38fccd8bfdb + .quad 0x3fac74591df72456 + .quad 0x3fac3534628016dd + .quad 0x3fabf62190448d22 + .quad 0x3fabb7209d1e24e5 + .quad 0x3fab78317eef1a29 + .quad 0x3fab39542ba23d73 + .quad 0x3faafa88992aea19 + .quad 0x3faabbcebd84fca0 + .quad 0x3faa7d268eb4c924 + .quad 0x3faa3e9002c711d2 + .quad 0x3faa000b0fd0fd6b + .quad 0x3fa9c197abf00dd7 + .quad 0x3fa98335cd4a16c3 + .quad 0x3fa944e56a0d3450 + .quad 0x3fa906a6786fc1cb + .quad 0x3fa8c878eeb05074 + .quad 0x3fa88a5cc3159e53 + .quad 0x3fa84c51ebee8d15 + .quad 0x3fa80e585f9218fc + .quad 0x3fa7d070145f4fd7 + .quad 0x3fa7929900bd4809 + .quad 0x3fa754d31b1b179c + .quad 0x3fa7171e59efcb5f + .quad 0x3fa6d97ab3ba5e10 + .quad 0x3fa69be81f01af99 + .quad 0x3fa65e6692547c4e + .quad 0x3fa620f604495440 + .quad 0x3fa5e3966b7e9295 + .quad 0x3fa5a647be9a54f6 + .quad 0x3fa56909f44a72fe + .quad 0x3fa52bdd034475b8 + .quad 0x3fa4eec0e2458f30 + .quad 0x3fa4b1b588129203 + .quad 0x3fa474baeb77e904 + .quad 0x3fa437d103498eec + .quad 0x3fa3faf7c663060e + .quad 0x3fa3be2f2ba7501f + .quad 0x3fa381772a00e604 + .quad 0x3fa344cfb861afae + .quad 0x3fa30838cdc2fbfd + .quad 0x3fa2cbb2612578b4 + .quad 0x3fa28f3c69912a74 + .quad 0x3fa252d6de1564c1 + .quad 0x3fa21681b5c8c213 + .quad 0x3fa1da3ce7c91bf8 + .quad 0x3fa19e086b3b8333 + .quad 0x3fa161e4374c37f4 + .quad 0x3fa125d0432ea20e + .quad 0x3fa0e9cc861d4944 + .quad 0x3fa0add8f759cd95 + .quad 0x3fa071f58e2cdf9b + .quad 0x3fa0362241e638ec + .quad 0x3f9ff4be13b92920 + .quad 0x3f9f7d57badb4ee8 + .quad 0x3f9f061167fc31e8 + .quad 0x3f9e8eeb09f2f6cb + .quad 0x3f9e17e48fa48962 + .quad 0x3f9da0fde8038de9 + .quad 0x3f9d2a3702105259 + .quad 0x3f9cb38fccd8bfdb + .quad 0x3f9c3d0837784c41 + .quad 0x3f9bc6a03117eb97 + .quad 0x3f9b5057a8ee01ce + .quad 0x3f9ada2e8e3e546f + .quad 0x3f9a6424d059fc68 + .quad 0x3f99ee3a5e9f57e8 + .quad 0x3f99786f2879fc53 + .quad 0x3f9902c31d62a843 + .quad 0x3f988d362cdf359e + .quad 0x3f9817c846828bbd + .quad 0x3f97a27959ec91aa + .quad 0x3f972d4956ca2067 + .quad 0x3f96b8382cd4f551 + .quad 0x3f964345cbd3a491 + .quad 0x3f95ce7223998b98 + .quad 0x3f9559bd2406c3ba + .quad 0x3f94e526bd0814d1 + .quad 0x3f9470aede96e7f2 + .quad 0x3f93fc5578b93a38 + .quad 0x3f93881a7b818f9e + .quad 0x3f9313fdd70ee5e8 + .quad 0x3f929fff7b8ca79d + .quad 0x3f922c1f59329f1b + .quad 0x3f91b85d6044e9ae + .quad 0x3f9144b98113eac0 + .quad 0x3f90d133abfc3f1b + .quad 0x3f905dcbd166b033 + .quad 0x3f8fd503c3904f1d + .quad 0x3f8eeeab9b43445d + .quad 0x3f8e088f0b004827 + .quad 0x3f8d22adf3f9579d + .quad 0x3f8c3d0837784c41 + .quad 0x3f8b579db6dec358 + .quad 0x3f8a726e53a6056e + .quad 0x3f898d79ef5eedf0 + .quad 0x3f88a8c06bb1d2f4 + .quad 0x3f87c441aa5e6d15 + .quad 0x3f86dffd8d3bbf70 + .quad 0x3f85fbf3f637ffc5 + .quad 0x3f851824c7587eb0 + .quad 0x3f84348fe2b99002 + .quad 0x3f8351352a8e733f + .quad 0x3f826e1481213c2e + .quad 0x3f818b2dc8d2bb91 + .quad 0x3f80a880e41a67f6 + .quad 0x3f7f8c1b6b0c8d4e + .quad 0x3f7dc7a83f75a96d + .quad 0x3f7c03a80ae5e054 + .quad 0x3f7a401a92ff827e + .quad 0x3f787cff9d9147a5 + .quad 0x3f76ba56f09621bc + .quad 0x3f74f8205235102d + .quad 0x3f73365b88c0f347 + .quad 0x3f7175085ab85ff0 + .quad 0x3f6f684d1d8ae702 + .quad 0x3f6be76bd77b4fc3 + .quad 0x3f68676c71434fb9 + .quad 0x3f64e84e793a474a + .quad 0x3f616a117e0d4b30 + .quad 0x3f5bd96a1d7d9cbc + .quad 0x3f54e071754c98ba + .quad 0x3f4bd27045bfd025 + .quad 0x3f3bcef518e29612 + .quad 0x8000000000000000 + /*== poly_coeff[5] ==*/ + .align 16 + .quad 0x3fb63C65231FBD16, 0x3fb63C65231FBD16 /* coeff5 */ + .quad 0xbfbBCB7D4EFBE80B, 0xbfbBCB7D4EFBE80B /* coeff4 */ + .quad 0x3fc287A7636F341E, 0x3fc287A7636F341E /* coeff3 */ + .quad 0xbfcBCB7B1526DE36, 0xbfcBCB7B1526DE36 /* coeff2 */ + .quad 0x3fdBCB7B1526E50E, 0x3fdBCB7B1526E50E /* coeff1 */ + /*== ExpMask ==*/ + .align 16 + .quad 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 16 + .quad 0x3f50000000000000, 0x3f50000000000000 + /*== MinNorm ==*/ + .align 16 + .quad 0x0010000000000000, 0x0010000000000000 + /*== MaxNorm ==*/ + .align 16 + .quad 0x7fefffffffffffff, 0x7fefffffffffffff + /*== HalfMask ==*/ + .align 16 + .quad 0xfffffffffc000000, 0xfffffffffc000000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== Threshold ==*/ + .align 16 + .quad 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 16 + .quad 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 16 + .quad 0x408ff00000000000, 0x408ff00000000000 + /*== L2 ==*/ + .align 16 + .quad 0x3fd34413509f79ff, 0x3fd34413509f79ff + .align 16 + .type __svml_dlog10_data_internal,@object + .size __svml_dlog10_data_internal,.-__svml_dlog10_data_internal + .space 48, 0x00 + .align 16 + +.FLT_12: + .long 0x00000000,0x43380000,0x00000000,0x43380000 + .type .FLT_12,@object + .size .FLT_12,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core-sse.S new file mode 100644 index 0000000000..0a101666f5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized log10, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_log10 _ZGVdN4v_log10_sse_wrapper +#include "../svml_d_log104_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core.c new file mode 100644 index 0000000000..48c63cfb3d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log10, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_log10 +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_log10, __GI__ZGVdN4v_log10, __redirect__ZGVdN4v_log10) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core_avx2.S new file mode 100644 index 0000000000..c77bfff32d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log104_core_avx2.S @@ -0,0 +1,1071 @@ +/* Function log10 vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log10(x) = k*log10(2.0) - log10(Rcp) + poly_approximation(R) + * log10(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog10_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 4128 +#define poly_coeff 8256 +#define ExpMask 8416 +#define Two10 8448 +#define MinNorm 8480 +#define MaxNorm 8512 +#define HalfMask 8544 +#define One 8576 +#define Threshold 8608 +#define Bias 8640 +#define Bias1 8672 +#define L2 8704 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_log10_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea -4222944+__svml_dlog10_data_internal(%rip), %r8 + vmovapd %ymm0, %ymm3 + +/* preserve mantissa, set input exponent to 2^(-10) */ + vandpd ExpMask+__svml_dlog10_data_internal(%rip), %ymm3, %ymm4 + vorpd Two10+__svml_dlog10_data_internal(%rip), %ymm4, %ymm2 + +/* reciprocal approximation good to at least 11 bits */ + vcvtpd2ps %ymm2, %xmm5 + +/* exponent bits */ + vpsrlq $20, %ymm3, %ymm7 + vmovupd One+__svml_dlog10_data_internal(%rip), %ymm14 + vrcpps %xmm5, %xmm6 + +/* check range */ + vcmplt_oqpd MinNorm+__svml_dlog10_data_internal(%rip), %ymm3, %ymm11 + vcmpnle_uqpd MaxNorm+__svml_dlog10_data_internal(%rip), %ymm3, %ymm12 + vcvtps2pd %xmm6, %ymm9 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + vroundpd $0, %ymm9, %ymm1 + +/* exponent*log(2.0) */ + vmovupd Threshold+__svml_dlog10_data_internal(%rip), %ymm9 + +/* + * prepare table index + * table lookup + */ + vpsrlq $40, %ymm1, %ymm15 + +/* argument reduction */ + vfmsub213pd %ymm14, %ymm1, %ymm2 + vcmplt_oqpd %ymm1, %ymm9, %ymm1 + vorpd %ymm12, %ymm11, %ymm13 + vmovupd poly_coeff+64+__svml_dlog10_data_internal(%rip), %ymm12 + vfmadd213pd poly_coeff+96+__svml_dlog10_data_internal(%rip), %ymm2, %ymm12 + +/* combine and get argument value range mask */ + vmovmskpd %ymm13, %eax + vmulpd %ymm2, %ymm2, %ymm13 + vextractf128 $1, %ymm7, %xmm8 + vshufps $221, %xmm8, %xmm7, %xmm10 + +/* biased exponent in DP format */ + vcvtdq2pd %xmm10, %ymm0 + vandpd Bias+__svml_dlog10_data_internal(%rip), %ymm1, %ymm10 + vorpd Bias1+__svml_dlog10_data_internal(%rip), %ymm10, %ymm11 + vsubpd %ymm11, %ymm0, %ymm0 + vmulpd L2+__svml_dlog10_data_internal(%rip), %ymm0, %ymm1 + +/* polynomial */ + vmovupd poly_coeff+__svml_dlog10_data_internal(%rip), %ymm0 + vfmadd213pd poly_coeff+32+__svml_dlog10_data_internal(%rip), %ymm2, %ymm0 + vmulpd poly_coeff+128+__svml_dlog10_data_internal(%rip), %ymm2, %ymm2 + vfmadd213pd %ymm12, %ymm13, %ymm0 + vfmadd213pd %ymm2, %ymm13, %ymm0 + vextractf128 $1, %ymm15, %xmm6 + vmovd %xmm15, %edx + vmovd %xmm6, %esi + movslq %edx, %rdx + vpextrd $2, %xmm15, %ecx + movslq %esi, %rsi + vpextrd $2, %xmm6, %edi + movslq %ecx, %rcx + movslq %edi, %rdi + vmovsd (%r8,%rdx), %xmm4 + vmovsd (%r8,%rsi), %xmm7 + vmovhpd (%r8,%rcx), %xmm4, %xmm5 + vmovhpd (%r8,%rdi), %xmm7, %xmm8 + vinsertf128 $1, %xmm8, %ymm5, %ymm14 + +/* reconstruction */ + vaddpd %ymm0, %ymm14, %ymm2 + vaddpd %ymm2, %ymm1, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm3, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call log10@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_log10_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dlog10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Log_HA_table[(1<<9)+2][2]; + __declspec(align(32)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(32)) VUINT32 poly_coeff[5][4][2]; + __declspec(align(32)) VUINT32 ExpMask[4][2]; + __declspec(align(32)) VUINT32 Two10[4][2]; + __declspec(align(32)) VUINT32 MinNorm[4][2]; + __declspec(align(32)) VUINT32 MaxNorm[4][2]; + __declspec(align(32)) VUINT32 HalfMask[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 Threshold[4][2]; + __declspec(align(32)) VUINT32 Bias[4][2]; + __declspec(align(32)) VUINT32 Bias1[4][2]; + __declspec(align(32)) VUINT32 L2[4][2]; +} __svml_dlog10_data_internal; +#endif +__svml_dlog10_data_internal: + /* Log_HA_table */ + .quad 0xc0733a7146f6b080, 0xbe1e707ce619c200 + .quad 0xc0733a7547771970, 0xbe1e79c6c06d6f51 + .quad 0xc0733a7945aacb70, 0xbe1e78e225fad29c + .quad 0xc0733a7d41946970, 0xbe1e76d607f9693b + .quad 0xc0733a813b3691f0, 0xbe1e7704b3e0685b + .quad 0xc0733a853293df00, 0xbe1e79c1216a27fa + .quad 0xc0733a8927aee660, 0xbe1e76dce5734a81 + .quad 0xc0733a8d1a8a3920, 0xbe1e782ee2ca4dba + .quad 0xc0733a910b286430, 0xbe1e7812d1a0a61f + .quad 0xc0733a94f98bf010, 0xbe1e77e1b5ecbc61 + .quad 0xc0733a98e5b76100, 0xbe1e76635cac1586 + .quad 0xc0733a9ccfad36f0, 0xbe1e7638f7968f32 + .quad 0xc0733aa0b76feda0, 0xbe1e7840ee76e365 + .quad 0xc0733aa49d01fcb0, 0xbe1e79f3fd01907e + .quad 0xc0733aa88065d7a0, 0xbe1e77bbb3a9c38a + .quad 0xc0733aac619dedb0, 0xbe1e7742719bf41d + .quad 0xc0733ab040acaa20, 0xbe1e79bcedaf79cb + .quad 0xc0733ab41d947450, 0xbe1e762d63cb7ca0 + .quad 0xc0733ab7f857af50, 0xbe1e77a07be83403 + .quad 0xc0733abbd0f8ba80, 0xbe1e7763ff836ad0 + .quad 0xc0733abfa779f130, 0xbe1e7737720ead39 + .quad 0xc0733ac37bddaad0, 0xbe1e7776a08e55e7 + .quad 0xc0733ac74e263af0, 0xbe1e793e3c52dd36 + .quad 0xc0733acb1e55f160, 0xbe1e788a94695051 + .quad 0xc0733aceec6f1a10, 0xbe1e76508114a813 + .quad 0xc0733ad2b873fd20, 0xbe1e76909457d23e + .quad 0xc0733ad68266df10, 0xbe1e7664a24f9ca4 + .quad 0xc0733ada4a4a0090, 0xbe1e7a07b3d44b18 + .quad 0xc0733ade101f9ee0, 0xbe1e76d87594704d + .quad 0xc0733ae1d3e9f340, 0xbe1e79563595a182 + .quad 0xc0733ae595ab33b0, 0xbe1e771880c3c6ab + .quad 0xc0733ae955659250, 0xbe1e78c171f517d4 + .quad 0xc0733aed131b3df0, 0xbe1e77eac3874666 + .quad 0xc0733af0cece61b0, 0xbe1e790db479d8f6 + .quad 0xc0733af488812550, 0xbe1e7965d1aa5c90 + .quad 0xc0733af84035ad10, 0xbe1e78ceb398ba47 + .quad 0xc0733afbf5ee19c0, 0xbe1e779cc0dcb5aa + .quad 0xc0733affa9ac88c0, 0xbe1e7871053953ed + .quad 0xc0733b035b731420, 0xbe1e7a082cffa71a + .quad 0xc0733b070b43d2a0, 0xbe1e7904b4382fad + .quad 0xc0733b0ab920d790, 0xbe1e79b458d0b4f3 + .quad 0xc0733b0e650c3310, 0xbe1e79d0ded414c6 + .quad 0xc0733b120f07f200, 0xbe1e763c357a1943 + .quad 0xc0733b15b7161dd0, 0xbe1e78b80ba6daaa + .quad 0xc0733b195d38bd00, 0xbe1e7998e23b8ffd + .quad 0xc0733b1d0171d2c0, 0xbe1e7974aa65ee8c + .quad 0xc0733b20a3c35f20, 0xbe1e76ccfde752ab + .quad 0xc0733b24442f5ef0, 0xbe1e77b4ff19debb + .quad 0xc0733b27e2b7cc10, 0xbe1e7772ee478542 + .quad 0xc0733b2b7f5e9d30, 0xbe1e781d81b58b44 + .quad 0xc0733b2f1a25c600, 0xbe1e78350d967565 + .quad 0xc0733b32b30f3720, 0xbe1e783888e48152 + .quad 0xc0733b364a1cde30, 0xbe1e78367bf7c111 + .quad 0xc0733b39df50a5d0, 0xbe1e7959e57ca47d + .quad 0xc0733b3d72ac75c0, 0xbe1e777322423222 + .quad 0xc0733b41043232b0, 0xbe1e767ce42a60aa + .quad 0xc0733b4493e3be70, 0xbe1e781d445aea19 + .quad 0xc0733b4821c2f800, 0xbe1e7922fca18e18 + .quad 0xc0733b4badd1bb80, 0xbe1e76fed3d40647 + .quad 0xc0733b4f3811e210, 0xbe1e793948c9eabc + .quad 0xc0733b52c0854240, 0xbe1e76e487656b8c + .quad 0xc0733b56472daf90, 0xbe1e780ab2f71223 + .quad 0xc0733b59cc0cfaf0, 0xbe1e77189120b09c + .quad 0xc0733b5d4f24f270, 0xbe1e7644a0343a12 + .quad 0xc0733b60d0776160, 0xbe1e78f2a3e4733d + .quad 0xc0733b6450061080, 0xbe1e7913b2f73ae5 + .quad 0xc0733b67cdd2c5c0, 0xbe1e7882d08393b5 + .quad 0xc0733b6b49df4470, 0xbe1e765e1b209979 + .quad 0xc0733b6ec42d4d20, 0xbe1e785c9c4620d4 + .quad 0xc0733b75b394f240, 0xbe1e78878cd0e956 + .quad 0xc0733b7c9c178630, 0xbe1e789a4112d90b + .quad 0xc0733b837dc2b0f0, 0xbe1e79050b8a1766 + .quad 0xc0733b8a58a3f220, 0xbe1e7790dffc47aa + .quad 0xc0733b912cc8a180, 0xbe1e77174593b06a + .quad 0xc0733b97fa3defb0, 0xbe1e7677de2d2ecc + .quad 0xc0733b9ec110e6b0, 0xbe1e76cff477ca18 + .quad 0xc0733ba5814e6a80, 0xbe1e78f8644dec7b + .quad 0xc0733bac3b0339d0, 0xbe1e764e1361788d + .quad 0xc0733bb2ee3bee30, 0xbe1e78c913e738de + .quad 0xc0733bb99b04fd30, 0xbe1e76666f5bddaa + .quad 0xc0733bc0416ab850, 0xbe1e77e87cbd8ab6 + .quad 0xc0733bc6e1794e10, 0xbe1e76f18ba1c966 + .quad 0xc0733bcd7b3cca10, 0xbe1e777c9461b8db + .quad 0xc0733bd40ec115d0, 0xbe1e78b78526ffac + .quad 0xc0733bda9c11f920, 0xbe1e7942abecfede + .quad 0xc0733be1233b1aa0, 0xbe1e76d8a684fd8c + .quad 0xc0733be7a4480010, 0xbe1e79622b539ac9 + .quad 0xc0733bee1f440f30, 0xbe1e7978e7cc20ea + .quad 0xc0733bf4943a8de0, 0xbe1e765c9c9de825 + .quad 0xc0733bfb0336a290, 0xbe1e775d8b138ee2 + .quad 0xc0733c016c435500, 0xbe1e78bf33465c2f + .quad 0xc0733c07cf6b8e80, 0xbe1e78164f7cc441 + .quad 0xc0733c0e2cba1a50, 0xbe1e7824e64d0b23 + .quad 0xc0733c148439a630, 0xbe1e78373ae7dd81 + .quad 0xc0733c1ad5f4c2c0, 0xbe1e7704513e0afe + .quad 0xc0733c2121f5e3d0, 0xbe1e7914aa84200f + .quad 0xc0733c2768476110, 0xbe1e76b1cde25cf6 + .quad 0xc0733c2da8f37600, 0xbe1e796120e3862d + .quad 0xc0733c33e40442e0, 0xbe1e78ec836d7e7b + .quad 0xc0733c3a1983cca0, 0xbe1e77fb13b7dabb + .quad 0xc0733c40497bfd70, 0xbe1e783c6fcb2404 + .quad 0xc0733c4673f6a530, 0xbe1e7628bb93dce8 + .quad 0xc0733c4c98fd7990, 0xbe1e7857a47b5001 + .quad 0xc0733c52b89a16d0, 0xbe1e76708dc2831f + .quad 0xc0733c58d2d5ffa0, 0xbe1e77b6038651f1 + .quad 0xc0733c5ee7ba9de0, 0xbe1e792e855bb5b2 + .quad 0xc0733c64f75142d0, 0xbe1e776cacd5c105 + .quad 0xc0733c6b01a32740, 0xbe1e77f8a8011315 + .quad 0xc0733c7106b96c30, 0xbe1e765cf3efcfde + .quad 0xc0733c77069d1ad0, 0xbe1e78d837d2efac + .quad 0xc0733c7d01572530, 0xbe1e78b615cf772c + .quad 0xc0733c82f6f06640, 0xbe1e7650bbbd7a25 + .quad 0xc0733c88e771a220, 0xbe1e78bcf3495872 + .quad 0xc0733c8ed2e386c0, 0xbe1e792266832e84 + .quad 0xc0733c94b94eabd0, 0xbe1e79c1c3c2ca52 + .quad 0xc0733c9a9abb9340, 0xbe1e78aa61e5807d + .quad 0xc0733ca07732a970, 0xbe1e7620fc4cf156 + .quad 0xc0733ca64ebc4570, 0xbe1e76b914a832c5 + .quad 0xc0733cac2160a970, 0xbe1e79227f72020e + .quad 0xc0733cb1ef280300, 0xbe1e77ac972cc008 + .quad 0xc0733cb7b81a6b10, 0xbe1e798089be41f4 + .quad 0xc0733cbd7c3fe6a0, 0xbe1e77942ae037fe + .quad 0xc0733cc33ba06690, 0xbe1e7956ae6463d9 + .quad 0xc0733cc8f643c850, 0xbe1e7918a50c7942 + .quad 0xc0733cceac31d5d0, 0xbe1e78308eeab604 + .quad 0xc0733cd45d7245e0, 0xbe1e76dd4ea88445 + .quad 0xc0733cda0a0cbc60, 0xbe1e77e7c1aa5909 + .quad 0xc0733cdfb208caa0, 0xbe1e7804b9d20e54 + .quad 0xc0733ce5556def70, 0xbe1e78f88e99d49c + .quad 0xc0733ceaf4439780, 0xbe1e787d74682d68 + .quad 0xc0733cf08e911d80, 0xbe1e76edc24fe6e7 + .quad 0xc0733cf6245dca50, 0xbe1e79b347ec86d2 + .quad 0xc0733cfbb5b0d580, 0xbe1e797cceb2c39b + .quad 0xc0733d0142916530, 0xbe1e783adbdc6aa1 + .quad 0xc0733d06cb068e70, 0xbe1e76e4c20e3d9e + .quad 0xc0733d0c4f175570, 0xbe1e77070bf3cf61 + .quad 0xc0733d11cecaadc0, 0xbe1e781c43502734 + .quad 0xc0733d174a277a80, 0xbe1e78b11268ea72 + .quad 0xc0733d1cc1348e90, 0xbe1e7754b83bfc7d + .quad 0xc0733d2233f8acb0, 0xbe1e7756c29bf5e9 + .quad 0xc0733d27a27a87d0, 0xbe1e7952fc1d9333 + .quad 0xc0733d2d0cc0c350, 0xbe1e778c76ae6077 + .quad 0xc0733d3272d1f2e0, 0xbe1e7a1896ba8f43 + .quad 0xc0733d37d4b49b30, 0xbe1e76dafdf432d8 + .quad 0xc0733d3d326f3180, 0xbe1e795330184013 + .quad 0xc0733d428c081c80, 0xbe1e763cc774d30f + .quad 0xc0733d47e185b3d0, 0xbe1e77030a779c0a + .quad 0xc0733d4d32ee40b0, 0xbe1e7908af2a2d7e + .quad 0xc0733d528047fe00, 0xbe1e78c4953b797d + .quad 0xc0733d57c9991850, 0xbe1e78b43b096579 + .quad 0xc0733d5d0ee7ae30, 0xbe1e7824ae0a4804 + .quad 0xc0733d625039d040, 0xbe1e79d2b2fbb740 + .quad 0xc0733d678d958190, 0xbe1e7662de59a1a6 + .quad 0xc0733d6cc700b760, 0xbe1e76b251d59aaa + .quad 0xc0733d71fc8159b0, 0xbe1e7a00cfd1f487 + .quad 0xc0733d772e1d4360, 0xbe1e77f4d246167e + .quad 0xc0733d7c5bda4200, 0xbe1e767a4ee8e6fc + .quad 0xc0733d8185be1640, 0xbe1e777ccf0a8aed + .quad 0xc0733d86abce7420, 0xbe1e767d7e279ada + .quad 0xc0733d8bce1102d0, 0xbe1e7a05cef4bb90 + .quad 0xc0733d90ec8b5d40, 0xbe1e78f75369be5b + .quad 0xc0733d96074311d0, 0xbe1e77b9612e8c8a + .quad 0xc0733d9b1e3da2b0, 0xbe1e794518b9adeb + .quad 0xc0733da031808620, 0xbe1e7810626fb934 + .quad 0xc0733da541112650, 0xbe1e76d87223fa6d + .quad 0xc0733daa4cf4e1a0, 0xbe1e794c5e7ca3b5 + .quad 0xc0733daf55310af0, 0xbe1e789856ef816f + .quad 0xc0733db459cae970, 0xbe1e77d2004effbd + .quad 0xc0733db95ac7b8f0, 0xbe1e78467d31eb9c + .quad 0xc0733dbe582caa00, 0xbe1e79aaa4e25787 + .quad 0xc0733dc351fee220, 0xbe1e762de8f107bf + .quad 0xc0733dc848437b90, 0xbe1e7670670a63fe + .quad 0xc0733dcd3aff85d0, 0xbe1e795ca237c6cc + .quad 0xc0733dd22a3805b0, 0xbe1e77e55c53c1d9 + .quad 0xc0733dd715f1f520, 0xbe1e78a806213ac4 + .quad 0xc0733ddbfe3243b0, 0xbe1e77743a2bc615 + .quad 0xc0733de0e2fdd660, 0xbe1e78b8b45b0b7d + .quad 0xc0733de5c4598800, 0xbe1e78d635f2f4b9 + .quad 0xc0733deaa24a2920, 0xbe1e7758c396a11e + .quad 0xc0733def7cd48020, 0xbe1e7a17a8cc454c + .quad 0xc0733df453fd49a0, 0xbe1e783caa73f616 + .quad 0xc0733df927c93820, 0xbe1e7932cfa29664 + .quad 0xc0733dfdf83cf490, 0xbe1e777d265c72a6 + .quad 0xc0733e02c55d1e10, 0xbe1e7775e7c03c60 + .quad 0xc0733e078f2e4a40, 0xbe1e79f65d52d232 + .quad 0xc0733e0c55b50570, 0xbe1e76e7e7464b4e + .quad 0xc0733e1118f5d250, 0xbe1e77be81cad877 + .quad 0xc0733e15d8f52a80, 0xbe1e79dd25b5fb3a + .quad 0xc0733e1a95b77e80, 0xbe1e78e45f1418ef + .quad 0xc0733e1f4f4135a0, 0xbe1e78eb7289505b + .quad 0xc0733e240596ae50, 0xbe1e78a468c07cad + .quad 0xc0733e28b8bc3e20, 0xbe1e776b558a4009 + .quad 0xc0733e2d68b631d0, 0xbe1e77412eb9941e + .quad 0xc0733e321588cd80, 0xbe1e76b2853f845e + .quad 0xc0733e36bf384cb0, 0xbe1e76aa7184273c + .quad 0xc0733e3b65c8e260, 0xbe1e7832027f78fa + .quad 0xc0733e40093eb930, 0xbe1e7a1c7da131f5 + .quad 0xc0733e44a99df380, 0xbe1e76a0bc2ae4bc + .quad 0xc0733e4946eaab30, 0xbe1e78dff13b6f5d + .quad 0xc0733e4de128f250, 0xbe1e765a226dea2c + .quad 0xc0733e52785cd290, 0xbe1e78509b989111 + .quad 0xc0733e570c8a4de0, 0xbe1e7916a4e9803d + .quad 0xc0733e5b9db55e30, 0xbe1e7950c15758cc + .quad 0xc0733e602be1f5a0, 0xbe1e7922ba1ad420 + .quad 0xc0733e64b713fe90, 0xbe1e794cbaabcef6 + .quad 0xc0733e693f4f5bc0, 0xbe1e7837bf883fed + .quad 0xc0733e6dc497e850, 0xbe1e76f198ddbbdf + .quad 0xc0733e7246f177d0, 0xbe1e7a18c1067764 + .quad 0xc0733e76c65fd6a0, 0xbe1e76b845a8fd9d + .quad 0xc0733e7b42e6c970, 0xbe1e7714012df506 + .quad 0xc0733e7fbc8a0de0, 0xbe1e7765612922cd + .quad 0xc0733e84334d5a50, 0xbe1e7688f5424a00 + .quad 0xc0733e88a7345df0, 0xbe1e769d011f6663 + .quad 0xc0733e8d1842c0e0, 0xbe1e79914acbfaf7 + .quad 0xc0733e91867c2460, 0xbe1e79a85e189bd7 + .quad 0xc0733e95f1e422a0, 0xbe1e79ea7c726432 + .quad 0xc0733e9a5a7e4f10, 0xbe1e768a6fbb8e6e + .quad 0xc0733e9ec04e3620, 0xbe1e793c75bcc9fc + .quad 0xc0733ea323575dd0, 0xbe1e797f78da13d4 + .quad 0xc0733ea7839d4550, 0xbe1e78d8c9cda978 + .quad 0xc0733eabe1236540, 0xbe1e77028d480fff + .quad 0xc0733eb03bed2fa0, 0xbe1e7a0d0f74ff7c + .quad 0xc0733eb493fe1040, 0xbe1e76732e8a35fb + .quad 0xc0733eb8e9596c30, 0xbe1e77220caeabeb + .quad 0xc0733ebd3c02a260, 0xbe1e797438b645ef + .quad 0xc0733ec18bfd0b80, 0xbe1e79207c5fd6e8 + .quad 0xc0733ec5d94bf9f0, 0xbe1e781c7df8f946 + .quad 0xc0733eca23f2b9f0, 0xbe1e76736284e2db + .quad 0xc0733ece6bf49190, 0xbe1e7a109cc0c3f5 + .quad 0xc0733ed2b154c120, 0xbe1e767f14a16d50 + .quad 0xc0733ed6f4168290, 0xbe1e789cd22acaf0 + .quad 0xc0733edb343d0a40, 0xbe1e764355ca28ad + .quad 0xc0733edf71cb8660, 0xbe1e79e4c7a81c45 + .quad 0xc0733ee3acc51fb0, 0xbe1e761e26b644c2 + .quad 0xc0733ee7e52cf8c0, 0xbe1e793e9f8fbdd3 + .quad 0xc0733eec1b062ed0, 0xbe1e78c432991c20 + .quad 0xc0733ef04e53d940, 0xbe1e78cdd025f4d8 + .quad 0xc0733ef47f1909f0, 0xbe1e778310c6446e + .quad 0xc0733ef8ad58cd20, 0xbe1e7871af3d6e17 + .quad 0xc0733efcd91629b0, 0xbe1e77e0e906f697 + .quad 0xc0733f01025420f0, 0xbe1e7a1ae9b27892 + .quad 0xc0733f052915af00, 0xbe1e76ac64c88f9d + .quad 0xc0733f094d5dca60, 0xbe1e779a815589c4 + .quad 0xc0733f0d6f2f6480, 0xbe1e788f39a4864c + .quad 0xc0733f118e8d6980, 0xbe1e79fc51263525 + .quad 0xc0733f15ab7ac060, 0xbe1e783501f19e90 + .quad 0xc0733f19c5fa4ae0, 0xbe1e767e82c327ab + .quad 0xc0733f1dde0ee5a0, 0xbe1e7a1785d66123 + .quad 0xc0733f21f3bb6870, 0xbe1e7936d07203da + .quad 0xc0733f260702a5e0, 0xbe1e7a010a7ac699 + .quad 0xc0733f2a17e76bb0, 0xbe1e7975e4e16312 + .quad 0xc0733f2e266c82b0, 0xbe1e7654b5422330 + .quad 0xc0733f323294aeb0, 0xbe1e77f8a4909d35 + .quad 0xc0733f363c62aee0, 0xbe1e792c8e30d226 + .quad 0xc0733f3a43d93da0, 0xbe1e76f6ac67a1ff + .quad 0xc0733f3e48fb1070, 0xbe1e775c2e97715a + .quad 0xc0733f424bcad840, 0xbe1e781cd54ae100 + /*== Log_LA_table ==*/ + .align 32 + .quad 0x0000000000000000 + .quad 0xbf4bc48a867884b7 + .quad 0xbf5bbd9e9482af09 + .quad 0xbf64c9096b94befd + .quad 0xbf6bafd47221ed26 + .quad 0xbf714999e2ad8ea6 + .quad 0xbf74b99563d2a1bd + .quad 0xbf7827de6b310350 + .quad 0xbf7b9476a4fcd10f + .quad 0xbf7eff5fbaf25781 + .quad 0xbf81344daa2d7553 + .quad 0xbf82e8158b08d957 + .quad 0xbf849b0851443684 + .quad 0xbf864d26cce610dd + .quad 0xbf87fe71ccc4e6b0 + .quad 0xbf89aeea1e897fdf + .quad 0xbf8b5e908eb13790 + .quad 0xbf8d0d65e890405a + .quad 0xbf8ebb6af653e2ee + .quad 0xbf90345040825bad + .quad 0xbf910a83a8446c78 + .quad 0xbf91e05015d30a71 + .quad 0xbf92b5b5ec0209d3 + .quad 0xbf938ab58d173e91 + .quad 0xbf945f4f5acb8be0 + .quad 0xbf953383b64bf13f + .quad 0xbf960753003a94ef + .quad 0xbf96dabd98afcc05 + .quad 0xbf97adc3df3b1ff8 + .quad 0xbf98806632e451d0 + .quad 0xbf9952a4f22c5ae9 + .quad 0xbf9a24807b0e6b5c + .quad 0xbf9af5f92b00e610 + .quad 0xbf9bc70f5ef65a77 + .quad 0xbf9c97c3735e7c0a + .quad 0xbf9d6815c4271775 + .quad 0xbf9e3806acbd058f + .quad 0xbf9f0796880d1c19 + .quad 0xbf9fd6c5b0851c4c + .quad 0xbfa052ca400a4f9b + .quad 0xbfa0ba01a8170000 + .quad 0xbfa121093ce3a205 + .quad 0xbfa187e12aad8077 + .quad 0xbfa1ee899d74a03e + .quad 0xbfa25502c0fc314c + .quad 0xbfa2bb4cc0cafe8d + .quad 0xbfa32167c82bdcda + .quad 0xbfa38754022e18e2 + .quad 0xbfa3ed1199a5e425 + .quad 0xbfa452a0b92cc0ec + .quad 0xbfa4b8018b21ed4f + .quad 0xbfa51d3439aacd4a + .quad 0xbfa58238eeb353da + .quad 0xbfa5e70fd3ee6b34 + .quad 0xbfa64bb912d65c07 + .quad 0xbfa6b034d4ad33df + .quad 0xbfa71483427d2a99 + .quad 0xbfa778a4851906f3 + .quad 0xbfa7dc98c51c8242 + .quad 0xbfa840602aecab3d + .quad 0xbfa8a3fadeb847f4 + .quad 0xbfa90769087836e4 + .quad 0xbfa96aaacfefcf3c + .quad 0xbfa9cdc05cad4042 + .quad 0xbfaa30a9d609efea + .quad 0xbfaa9367632ad897 + .quad 0xbfaaf5f92b00e610 + .quad 0xbfab585f544951a4 + .quad 0xbfabba9a058dfd84 + .quad 0xbfac1ca96525cf56 + .quad 0xbfac7e8d993509f9 + .quad 0xbface046c7ada68d + .quad 0xbfad41d5164facb4 + .quad 0xbfada338aaa98a0c + .quad 0xbfae0471aa1868f5 + .quad 0xbfae658039c88690 + .quad 0xbfaec6647eb58808 + .quad 0xbfaf271e9daacf20 + .quad 0xbfaf87aebb43ce06 + .quad 0xbfafe814fbec5a77 + .quad 0xbfb02428c1f08016 + .quad 0xbfb054323b97a948 + .quad 0xbfb08426fcdb1ee7 + .quad 0xbfb0b40717932b96 + .quad 0xbfb0e3d29d81165e + .quad 0xbfb11389a04f4a2e + .quad 0xbfb1432c31917d08 + .quad 0xbfb172ba62c4d6de + .quad 0xbfb1a23445501816 + .quad 0xbfb1d199ea83bfbe + .quad 0xbfb200eb639a3173 + .quad 0xbfb23028c1b7daed + .quad 0xbfb25f5215eb594a + .quad 0xbfb28e67712d9dfc + .quad 0xbfb2bd68e4621371 + .quad 0xbfb2ec568056c16f + .quad 0xbfb31b3055c47118 + .quad 0xbfb349f6754ed0b4 + .quad 0xbfb378a8ef84971e + .quad 0xbfb3a747d4dfa6f5 + .quad 0xbfb3d5d335c53179 + .quad 0xbfb4044b2285d925 + .quad 0xbfb432afab5dd3ff + .quad 0xbfb46100e0750da1 + .quad 0xbfb48f3ed1df48fb + .quad 0xbfb4bd698f9c41cf + .quad 0xbfb4eb812997cde4 + .quad 0xbfb51985afa9fdfd + .quad 0xbfb5477731973e85 + .quad 0xbfb57555bf1077f5 + .quad 0xbfb5a32167b32f02 + .quad 0xbfb5d0da3b09a47e + .quad 0xbfb5fe80488af4fd + .quad 0xbfb62c139f9b3837 + .quad 0xbfb659944f8ba02d + .quad 0xbfb68702679a980a + .quad 0xbfb6b45df6f3e2c9 + .quad 0xbfb6e1a70cb0b99a + .quad 0xbfb70eddb7d7ea07 + .quad 0xbfb73c02075df3e5 + .quad 0xbfb769140a2526fd + .quad 0xbfb79613cefdc07d + .quad 0xbfb7c30164a60836 + .quad 0xbfb7efdcd9ca6d8f + .quad 0xbfb81ca63d05a44a + .quad 0xbfb8495d9ce0c10c + .quad 0xbfb8760307d355ab + .quad 0xbfb8a2968c438d41 + .quad 0xbfb8cf183886480d + .quad 0xbfb8fb881adf3713 + .quad 0xbfb927e64180f790 + .quad 0xbfb95432ba8d2e2f + .quad 0xbfb9806d9414a209 + .quad 0xbfb9ac96dc175776 + .quad 0xbfb9d8aea084aa9c + .quad 0xbfba04b4ef3b69d8 + .quad 0xbfba30a9d609efea + .quad 0xbfba5c8d62ae3dec + .quad 0xbfba885fa2d6151e + .quad 0xbfbab420a41f1076 + .quad 0xbfbadfd07416be07 + .quad 0xbfbb0b6f203ab82c + .quad 0xbfbb36fcb5f8be8a + .quad 0xbfbb627942aecedd + .quad 0xbfbb8de4d3ab3d98 + .quad 0xbfbbb93f762cce4f + .quad 0xbfbbe4893762cbf7 + .quad 0xbfbc0fc2246d20f5 + .quad 0xbfbc3aea4a5c6eff + .quad 0xbfbc6601b63226cb + .quad 0xbfbc910874e09f98 + .quad 0xbfbcbbfe934b2e81 + .quad 0xbfbce6e41e463da5 + .quad 0xbfbd11b92297632b + .quad 0xbfbd3c7dacf5780b + .quad 0xbfbd6731ca08aeb9 + .quad 0xbfbd91d5866aa99c + .quad 0xbfbdbc68eea6915b + .quad 0xbfbde6ec0f392b05 + .quad 0xbfbe115ef490ee07 + .quad 0xbfbe3bc1ab0e19fe + .quad 0xbfbe66143f02cc5d + .quad 0xbfbe9056bcb315e8 + .quad 0xbfbeba893055100b + .quad 0xbfbee4aba610f204 + .quad 0xbfbf0ebe2a0125eb + .quad 0xbfbf38c0c8325d86 + .quad 0xbfbf62b38ca3a706 + .quad 0xbfbf8c9683468191 + .quad 0xbfbfb669b7fef1a8 + .quad 0xbfbfe02d36a3956d + .quad 0xbfc004f0857edc5c + .quad 0xbfc019c2a064b486 + .quad 0xbfc02e8cf1dac4b8 + .quad 0xbfc0434f7fb1f307 + .quad 0xbfc0580a4fb4a3df + .quad 0xbfc06cbd67a6c3b6 + .quad 0xbfc08168cd45d0a9 + .quad 0xbfc0960c8648e406 + .quad 0xbfc0aaa89860bbcf + .quad 0xbfc0bf3d0937c41c + .quad 0xbfc0d3c9de722078 + .quad 0xbfc0e84f1dadb526 + .quad 0xbfc0fccccc823059 + .quad 0xbfc11142f0811357 + .quad 0xbfc125b18f35bb8e + .quad 0xbfc13a18ae256b99 + .quad 0xbfc14e7852cf5430 + .quad 0xbfc162d082ac9d10 + .quad 0xbfc1772143306dc6 + .quad 0xbfc18b6a99c7f679 + .quad 0xbfc19fac8bda7897 + .quad 0xbfc1b3e71ec94f7b + .quad 0xbfc1c81a57eff8fd + .quad 0xbfc1dc463ca41df8 + .quad 0xbfc1f06ad2359abd + .quad 0xbfc204881dee8777 + .quad 0xbfc2189e25134081 + .quad 0xbfc22cacece26ead + .quad 0xbfc240b47a950f79 + .quad 0xbfc254b4d35e7d3c + .quad 0xbfc268adfc6c773e + .quad 0xbfc27c9ffae729c1 + .quad 0xbfc2908ad3f13603 + .quad 0xbfc2a46e8ca7ba2a + .quad 0xbfc2b84b2a225923 + .quad 0xbfc2cc20b1734279 + .quad 0xbfc2dfef27a73a18 + .quad 0xbfc2f3b691c5a001 + .quad 0xbfc30776f4d077f7 + .quad 0xbfc31b3055c47118 + .quad 0xbfc32ee2b998ed6e + .quad 0xbfc3428e2540096d + .quad 0x3fc331f403985097 + .quad 0x3fc31e56798a910a + .quad 0x3fc30abfd8f333b6 + .quad 0x3fc2f7301cf4e87b + .quad 0x3fc2e3a740b7800f + .quad 0x3fc2d0253f67e4cb + .quad 0x3fc2bcaa14381386 + .quad 0x3fc2a935ba5f1479 + .quad 0x3fc295c82d18f434 + .quad 0x3fc2826167a6bc9c + .quad 0x3fc26f01654e6df6 + .quad 0x3fc25ba8215af7fc + .quad 0x3fc24855971c3307 + .quad 0x3fc23509c1e6d937 + .quad 0x3fc221c49d147fb3 + .quad 0x3fc20e8624038fed + .quad 0x3fc1fb4e521740f4 + .quad 0x3fc1e81d22b790d4 + .quad 0x3fc1d4f291513e01 + .quad 0x3fc1c1ce9955c0c6 + .quad 0x3fc1aeb1363b44c8 + .quad 0x3fc19b9a637ca295 + .quad 0x3fc1888a1c995931 + .quad 0x3fc175805d1587c1 + .quad 0x3fc1627d2079e731 + .quad 0x3fc14f806253c3ed + .quad 0x3fc13c8a1e34f7a0 + .quad 0x3fc1299a4fb3e306 + .quad 0x3fc116b0f26b67bb + .quad 0x3fc103ce01fae223 + .quad 0x3fc0f0f17a062353 + .quad 0x3fc0de1b56356b04 + .quad 0x3fc0cb4b9235619a + .quad 0x3fc0b88229b71227 + .quad 0x3fc0a5bf186fe483 + .quad 0x3fc093025a19976c + .quad 0x3fc0804bea723aa9 + .quad 0x3fc06d9bc53c2941 + .quad 0x3fc05af1e63e03b4 + .quad 0x3fc0484e4942aa43 + .quad 0x3fc035b0ea19373b + .quad 0x3fc02319c494f951 + .quad 0x3fc01088d48d6e03 + .quad 0x3fbffbfc2bbc7803 + .quad 0x3fbfd6f308ce5b52 + .quad 0x3fbfb1f6381856f4 + .quad 0x3fbf8d05b16a6d47 + .quad 0x3fbf68216c9cc727 + .quad 0x3fbf4349618fa91a + .quad 0x3fbf1e7d882b689a + .quad 0x3fbef9bdd860616b + .quad 0x3fbed50a4a26eafc + .quad 0x3fbeb062d57f4de8 + .quad 0x3fbe8bc77271b97a + .quad 0x3fbe6738190e394c + .quad 0x3fbe42b4c16caaf3 + .quad 0x3fbe1e3d63acb3ba + .quad 0x3fbdf9d1f7f5b674 + .quad 0x3fbdd5727676c959 + .quad 0x3fbdb11ed766abf4 + .quad 0x3fbd8cd71303bd26 + .quad 0x3fbd689b2193f133 + .quad 0x3fbd446afb64c7e5 + .quad 0x3fbd204698cb42bd + .quad 0x3fbcfc2df223db2d + .quad 0x3fbcd820ffd278f3 + .quad 0x3fbcb41fba42686d + .quad 0x3fbc902a19e65111 + .quad 0x3fbc6c4017382bea + .quad 0x3fbc4861aab93a23 + .quad 0x3fbc248eccf1fba6 + .quad 0x3fbc00c7767225cb + .quad 0x3fbbdd0b9fd09a10 + .quad 0x3fbbb95b41ab5ce6 + .quad 0x3fbb95b654a78c87 + .quad 0x3fbb721cd17157e3 + .quad 0x3fbb4e8eb0bbf58f + .quad 0x3fbb2b0beb419ad0 + .quad 0x3fbb079479c372ad + .quad 0x3fbae4285509950b + .quad 0x3fbac0c775e2fde6 + .quad 0x3fba9d71d5258484 + .quad 0x3fba7a276badd2c8 + .quad 0x3fba56e8325f5c87 + .quad 0x3fba33b4222456f1 + .quad 0x3fba108b33edb005 + .quad 0x3fb9ed6d60b30612 + .quad 0x3fb9ca5aa1729f45 + .quad 0x3fb9a752ef316149 + .quad 0x3fb9845642fac8f0 + .quad 0x3fb9616495e0e1e8 + .quad 0x3fb93e7de0fc3e80 + .quad 0x3fb91ba21d6bef77 + .quad 0x3fb8f8d144557bdf + .quad 0x3fb8d60b4ee4d901 + .quad 0x3fb8b350364c6257 + .quad 0x3fb8909ff3c4d191 + .quad 0x3fb86dfa808d36a0 + .quad 0x3fb84b5fd5eaefd8 + .quad 0x3fb828cfed29a215 + .quad 0x3fb8064abf9b30f1 + .quad 0x3fb7e3d04697b704 + .quad 0x3fb7c1607b7d7e32 + .quad 0x3fb79efb57b0f803 + .quad 0x3fb77ca0d49cb608 + .quad 0x3fb75a50ebb1624a + .quad 0x3fb7380b9665b7c8 + .quad 0x3fb715d0ce367afc + .quad 0x3fb6f3a08ca67270 + .quad 0x3fb6d17acb3e5f5e + .quad 0x3fb6af5f838cf654 + .quad 0x3fb68d4eaf26d7ee + .quad 0x3fb66b4847a68997 + .quad 0x3fb6494c46ac6e4d + .quad 0x3fb6275aa5debf81 + .quad 0x3fb605735ee985f1 + .quad 0x3fb5e3966b7e9295 + .quad 0x3fb5c1c3c5557799 + .quad 0x3fb59ffb662b815c + .quad 0x3fb57e3d47c3af7b + .quad 0x3fb55c8963e6adeb + .quad 0x3fb53adfb462ce16 + .quad 0x3fb51940330c000b + .quad 0x3fb4f7aad9bbcbaf + .quad 0x3fb4d61fa2514a00 + .quad 0x3fb4b49e86b11e5f + .quad 0x3fb4932780c56fe2 + .quad 0x3fb471ba8a7de2b7 + .quad 0x3fb450579dcf9186 + .quad 0x3fb42efeb4b506e9 + .quad 0x3fb40dafc92e36e2 + .quad 0x3fb3ec6ad5407868 + .quad 0x3fb3cb2fd2f67ef1 + .quad 0x3fb3a9febc60540a + .quad 0x3fb388d78b9350ff + .quad 0x3fb367ba3aaa1883 + .quad 0x3fb346a6c3c49066 + .quad 0x3fb3259d2107db54 + .quad 0x3fb3049d4c9e52a0 + .quad 0x3fb2e3a740b7800f + .quad 0x3fb2c2baf78817b7 + .quad 0x3fb2a1d86b49f1e2 + .quad 0x3fb280ff963c04fc + .quad 0x3fb2603072a25f82 + .quad 0x3fb23f6afac6220a + .quad 0x3fb21eaf28f57941 + .quad 0x3fb1fdfcf7839804 + .quad 0x3fb1dd5460c8b16f + .quad 0x3fb1bcb55f21f307 + .quad 0x3fb19c1fecf17ee0 + .quad 0x3fb17b94049e65d0 + .quad 0x3fb15b11a094a1aa + .quad 0x3fb13a98bb450f81 + .quad 0x3fb11a294f2569f6 + .quad 0x3fb0f9c356b04389 + .quad 0x3fb0d966cc6500fa + .quad 0x3fb0b913aac7d3a7 + .quad 0x3fb098c9ec61b3ff + .quad 0x3fb078898bc05bf4 + .quad 0x3fb0585283764178 + .quad 0x3fb03824ce1a9101 + .quad 0x3fb0180066492817 + .quad 0x3fafefca8d451fd6 + .quad 0x3fafafa6d397efdb + .quad 0x3faf6f9594de60f0 + .quad 0x3faf2f96c6754aee + .quad 0x3faeefaa5dc2b239 + .quad 0x3faeafd05035bd3b + .quad 0x3fae70089346a9e6 + .quad 0x3fae30531c76c34a + .quad 0x3fadf0afe1505738 + .quad 0x3fadb11ed766abf4 + .quad 0x3fad719ff455f5f7 + .quad 0x3fad32332dc34dbd + .quad 0x3facf2d8795ca5a5 + .quad 0x3facb38fccd8bfdb + .quad 0x3fac74591df72456 + .quad 0x3fac3534628016dd + .quad 0x3fabf62190448d22 + .quad 0x3fabb7209d1e24e5 + .quad 0x3fab78317eef1a29 + .quad 0x3fab39542ba23d73 + .quad 0x3faafa88992aea19 + .quad 0x3faabbcebd84fca0 + .quad 0x3faa7d268eb4c924 + .quad 0x3faa3e9002c711d2 + .quad 0x3faa000b0fd0fd6b + .quad 0x3fa9c197abf00dd7 + .quad 0x3fa98335cd4a16c3 + .quad 0x3fa944e56a0d3450 + .quad 0x3fa906a6786fc1cb + .quad 0x3fa8c878eeb05074 + .quad 0x3fa88a5cc3159e53 + .quad 0x3fa84c51ebee8d15 + .quad 0x3fa80e585f9218fc + .quad 0x3fa7d070145f4fd7 + .quad 0x3fa7929900bd4809 + .quad 0x3fa754d31b1b179c + .quad 0x3fa7171e59efcb5f + .quad 0x3fa6d97ab3ba5e10 + .quad 0x3fa69be81f01af99 + .quad 0x3fa65e6692547c4e + .quad 0x3fa620f604495440 + .quad 0x3fa5e3966b7e9295 + .quad 0x3fa5a647be9a54f6 + .quad 0x3fa56909f44a72fe + .quad 0x3fa52bdd034475b8 + .quad 0x3fa4eec0e2458f30 + .quad 0x3fa4b1b588129203 + .quad 0x3fa474baeb77e904 + .quad 0x3fa437d103498eec + .quad 0x3fa3faf7c663060e + .quad 0x3fa3be2f2ba7501f + .quad 0x3fa381772a00e604 + .quad 0x3fa344cfb861afae + .quad 0x3fa30838cdc2fbfd + .quad 0x3fa2cbb2612578b4 + .quad 0x3fa28f3c69912a74 + .quad 0x3fa252d6de1564c1 + .quad 0x3fa21681b5c8c213 + .quad 0x3fa1da3ce7c91bf8 + .quad 0x3fa19e086b3b8333 + .quad 0x3fa161e4374c37f4 + .quad 0x3fa125d0432ea20e + .quad 0x3fa0e9cc861d4944 + .quad 0x3fa0add8f759cd95 + .quad 0x3fa071f58e2cdf9b + .quad 0x3fa0362241e638ec + .quad 0x3f9ff4be13b92920 + .quad 0x3f9f7d57badb4ee8 + .quad 0x3f9f061167fc31e8 + .quad 0x3f9e8eeb09f2f6cb + .quad 0x3f9e17e48fa48962 + .quad 0x3f9da0fde8038de9 + .quad 0x3f9d2a3702105259 + .quad 0x3f9cb38fccd8bfdb + .quad 0x3f9c3d0837784c41 + .quad 0x3f9bc6a03117eb97 + .quad 0x3f9b5057a8ee01ce + .quad 0x3f9ada2e8e3e546f + .quad 0x3f9a6424d059fc68 + .quad 0x3f99ee3a5e9f57e8 + .quad 0x3f99786f2879fc53 + .quad 0x3f9902c31d62a843 + .quad 0x3f988d362cdf359e + .quad 0x3f9817c846828bbd + .quad 0x3f97a27959ec91aa + .quad 0x3f972d4956ca2067 + .quad 0x3f96b8382cd4f551 + .quad 0x3f964345cbd3a491 + .quad 0x3f95ce7223998b98 + .quad 0x3f9559bd2406c3ba + .quad 0x3f94e526bd0814d1 + .quad 0x3f9470aede96e7f2 + .quad 0x3f93fc5578b93a38 + .quad 0x3f93881a7b818f9e + .quad 0x3f9313fdd70ee5e8 + .quad 0x3f929fff7b8ca79d + .quad 0x3f922c1f59329f1b + .quad 0x3f91b85d6044e9ae + .quad 0x3f9144b98113eac0 + .quad 0x3f90d133abfc3f1b + .quad 0x3f905dcbd166b033 + .quad 0x3f8fd503c3904f1d + .quad 0x3f8eeeab9b43445d + .quad 0x3f8e088f0b004827 + .quad 0x3f8d22adf3f9579d + .quad 0x3f8c3d0837784c41 + .quad 0x3f8b579db6dec358 + .quad 0x3f8a726e53a6056e + .quad 0x3f898d79ef5eedf0 + .quad 0x3f88a8c06bb1d2f4 + .quad 0x3f87c441aa5e6d15 + .quad 0x3f86dffd8d3bbf70 + .quad 0x3f85fbf3f637ffc5 + .quad 0x3f851824c7587eb0 + .quad 0x3f84348fe2b99002 + .quad 0x3f8351352a8e733f + .quad 0x3f826e1481213c2e + .quad 0x3f818b2dc8d2bb91 + .quad 0x3f80a880e41a67f6 + .quad 0x3f7f8c1b6b0c8d4e + .quad 0x3f7dc7a83f75a96d + .quad 0x3f7c03a80ae5e054 + .quad 0x3f7a401a92ff827e + .quad 0x3f787cff9d9147a5 + .quad 0x3f76ba56f09621bc + .quad 0x3f74f8205235102d + .quad 0x3f73365b88c0f347 + .quad 0x3f7175085ab85ff0 + .quad 0x3f6f684d1d8ae702 + .quad 0x3f6be76bd77b4fc3 + .quad 0x3f68676c71434fb9 + .quad 0x3f64e84e793a474a + .quad 0x3f616a117e0d4b30 + .quad 0x3f5bd96a1d7d9cbc + .quad 0x3f54e071754c98ba + .quad 0x3f4bd27045bfd025 + .quad 0x3f3bcef518e29612 + .quad 0x8000000000000000 + /*== poly_coeff[5] ==*/ + .align 32 + .quad 0x3fb63C65231FBD16, 0x3fb63C65231FBD16, 0x3fb63C65231FBD16, 0x3fb63C65231FBD16 /* coeff5 */ + .quad 0xbfbBCB7D4EFBE80B, 0xbfbBCB7D4EFBE80B, 0xbfbBCB7D4EFBE80B, 0xbfbBCB7D4EFBE80B /* coeff4 */ + .quad 0x3fc287A7636F341E, 0x3fc287A7636F341E, 0x3fc287A7636F341E, 0x3fc287A7636F341E /* coeff3 */ + .quad 0xbfcBCB7B1526DE36, 0xbfcBCB7B1526DE36, 0xbfcBCB7B1526DE36, 0xbfcBCB7B1526DE36 /* coeff2 */ + .quad 0x3fdBCB7B1526E50E, 0x3fdBCB7B1526E50E, 0x3fdBCB7B1526E50E, 0x3fdBCB7B1526E50E /* coeff1 */ + /*== ExpMask ==*/ + .align 32 + .quad 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 32 + .quad 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000 + /*== MinNorm ==*/ + .align 32 + .quad 0x0010000000000000, 0x0010000000000000, 0x0010000000000000, 0x0010000000000000 + /*== MaxNorm ==*/ + .align 32 + .quad 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff + /*== HalfMask ==*/ + .align 32 + .quad 0xfffffffffc000000, 0xfffffffffc000000, 0xfffffffffc000000, 0xfffffffffc000000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== Threshold ==*/ + .align 32 + .quad 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 32 + .quad 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 32 + .quad 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000 + /*== L2 ==*/ + .align 32 + .quad 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff + .align 32 + .type __svml_dlog10_data_internal,@object + .size __svml_dlog10_data_internal,.-__svml_dlog10_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core-avx2.S new file mode 100644 index 0000000000..3432e7cffe --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized log10, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_log10 _ZGVeN8v_log10_avx2_wrapper +#include "../svml_d_log108_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core.c new file mode 100644 index 0000000000..273a0d4739 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log10, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_log10 +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_log10, __GI__ZGVeN8v_log10, __redirect__ZGVeN8v_log10) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core_avx512.S new file mode 100644 index 0000000000..0799f99eba --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log108_core_avx512.S @@ -0,0 +1,299 @@ +/* Function log10 vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log10(x) = k*log10(2.0) - log10(Rcp) + poly_approximation(R) + * log10(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog10_data_internal_avx512 + */ +#define Log_tbl 0 +#define One 128 +#define C075 192 +#define poly_coeff9 256 +#define poly_coeff8 320 +#define poly_coeff7 384 +#define poly_coeff6 448 +#define poly_coeff5 512 +#define poly_coeff4 576 +#define poly_coeff3 640 +#define poly_coeff2 704 +#define poly_coeff1 768 +#define L2 832 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_log10_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovaps %zmm0, %zmm7 + vgetmantpd $8, {sae}, %zmm7, %zmm6 + vmovups One+__svml_dlog10_data_internal_avx512(%rip), %zmm3 + vmovups poly_coeff5+__svml_dlog10_data_internal_avx512(%rip), %zmm12 + vmovups poly_coeff3+__svml_dlog10_data_internal_avx512(%rip), %zmm13 + +/* Start polynomial evaluation */ + vmovups poly_coeff9+__svml_dlog10_data_internal_avx512(%rip), %zmm10 + vmovups poly_coeff8+__svml_dlog10_data_internal_avx512(%rip), %zmm1 + vmovups poly_coeff7+__svml_dlog10_data_internal_avx512(%rip), %zmm11 + vmovups poly_coeff6+__svml_dlog10_data_internal_avx512(%rip), %zmm14 + +/* Prepare exponent correction: DblRcp<0.75? */ + vmovups C075+__svml_dlog10_data_internal_avx512(%rip), %zmm2 + +/* Table lookup */ + vmovups __svml_dlog10_data_internal_avx512(%rip), %zmm5 + +/* GetExp(x) */ + vgetexppd {sae}, %zmm7, %zmm0 + +/* DblRcp ~ 1/Mantissa */ + vrcp14pd %zmm6, %zmm8 + +/* x<=0? */ + vfpclasspd $94, %zmm7, %k0 + +/* round DblRcp to 4 fractional bits (RN mode, no Precision exception) */ + vrndscalepd $88, {sae}, %zmm8, %zmm4 + vmovups poly_coeff4+__svml_dlog10_data_internal_avx512(%rip), %zmm8 + kmovw %k0, %edx + +/* Reduced argument: R = DblRcp*Mantissa - 1 */ + vfmsub213pd {rn-sae}, %zmm3, %zmm4, %zmm6 + vcmppd $17, {sae}, %zmm2, %zmm4, %k1 + vfmadd231pd {rn-sae}, %zmm6, %zmm12, %zmm8 + vmovups poly_coeff2+__svml_dlog10_data_internal_avx512(%rip), %zmm12 + vfmadd231pd {rn-sae}, %zmm6, %zmm10, %zmm1 + vfmadd231pd {rn-sae}, %zmm6, %zmm11, %zmm14 + vmovups poly_coeff1+__svml_dlog10_data_internal_avx512(%rip), %zmm2 + +/* R^2 */ + vmulpd {rn-sae}, %zmm6, %zmm6, %zmm15 + vfmadd231pd {rn-sae}, %zmm6, %zmm13, %zmm12 + +/* Prepare table index */ + vpsrlq $48, %zmm4, %zmm9 + +/* add 1 to Expon if DblRcp<0.75 */ + vaddpd {rn-sae}, %zmm3, %zmm0, %zmm0{%k1} + vmulpd {rn-sae}, %zmm15, %zmm15, %zmm13 + vfmadd213pd {rn-sae}, %zmm14, %zmm15, %zmm1 + vfmadd213pd {rn-sae}, %zmm12, %zmm15, %zmm8 + vpermt2pd Log_tbl+64+__svml_dlog10_data_internal_avx512(%rip), %zmm9, %zmm5 + +/* polynomial */ + vfmadd213pd {rn-sae}, %zmm8, %zmm13, %zmm1 + vfmadd213pd {rn-sae}, %zmm2, %zmm6, %zmm1 + vfmadd213pd {rn-sae}, %zmm5, %zmm1, %zmm6 + vmovups L2+__svml_dlog10_data_internal_avx512(%rip), %zmm1 + vfmadd213pd {rn-sae}, %zmm6, %zmm1, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm7 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm7, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call log10@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_log10_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dlog10_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl[16][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 C075[8][2]; + __declspec(align(64)) VUINT32 poly_coeff9[8][2]; + __declspec(align(64)) VUINT32 poly_coeff8[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + __declspec(align(64)) VUINT32 L2[8][2]; + } __svml_dlog10_data_internal_avx512; +#endif +__svml_dlog10_data_internal_avx512: + /*== Log_tbl ==*/ + .quad 0x0000000000000000 + .quad 0xbf9af5f92b00e610 + .quad 0xbfaa30a9d609efea + .quad 0xbfb31b3055c47118 + .quad 0xbfb8cf183886480d + .quad 0xbfbe3bc1ab0e19fe + .quad 0xbfc1b3e71ec94f7b + .quad 0xbfc42c7e7fe3fc02 + .quad 0x3fbffbfc2bbc7803 + .quad 0x3fbb721cd17157e3 + .quad 0x3fb715d0ce367afc + .quad 0x3fb2e3a740b7800f + .quad 0x3fadb11ed766abf4 + .quad 0x3fa5e3966b7e9295 + .quad 0x3f9cb38fccd8bfdb + .quad 0x3f8c3d0837784c41 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== 0.75 ==*/ + .align 64 + .quad 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000 + /*== poly_coeff9 ==*/ + .align 64 + .quad 0x3fa8c2d828480370, 0x3fa8c2d828480370, 0x3fa8c2d828480370, 0x3fa8c2d828480370, 0x3fa8c2d828480370, 0x3fa8c2d828480370, 0x3fa8c2d828480370, 0x3fa8c2d828480370 + /*== poly_coeff8 ==*/ + .align 64 + .quad 0xbfabd80d96029814, 0xbfabd80d96029814, 0xbfabd80d96029814, 0xbfabd80d96029814, 0xbfabd80d96029814, 0xbfabd80d96029814, 0xbfabd80d96029814, 0xbfabd80d96029814 + /*== poly_coeff7 ==*/ + .align 64 + .quad 0x3fafc3f6f38b58a2, 0x3fafc3f6f38b58a2, 0x3fafc3f6f38b58a2, 0x3fafc3f6f38b58a2, 0x3fafc3f6f38b58a2, 0x3fafc3f6f38b58a2, 0x3fafc3f6f38b58a2, 0x3fafc3f6f38b58a2 + /*== poly_coeff6 ==*/ + .align 64 + .quad 0xbfb287a63464dc80, 0xbfb287a63464dc80, 0xbfb287a63464dc80, 0xbfb287a63464dc80, 0xbfb287a63464dc80, 0xbfb287a63464dc80, 0xbfb287a63464dc80, 0xbfb287a63464dc80 + /*== poly_coeff5 ==*/ + .align 64 + .quad 0x3fb63c62777f27d9, 0x3fb63c62777f27d9, 0x3fb63c62777f27d9, 0x3fb63c62777f27d9, 0x3fb63c62777f27d9, 0x3fb63c62777f27d9, 0x3fb63c62777f27d9, 0x3fb63c62777f27d9 + /*== poly_coeff4 ==*/ + .align 64 + .quad 0xbfbbcb7b153c06a3, 0xbfbbcb7b153c06a3, 0xbfbbcb7b153c06a3, 0xbfbbcb7b153c06a3, 0xbfbbcb7b153c06a3, 0xbfbbcb7b153c06a3, 0xbfbbcb7b153c06a3, 0xbfbbcb7b153c06a3 + /*== poly_coeff3 ==*/ + .align 64 + .quad 0x3fc287a7636f428c, 0x3fc287a7636f428c, 0x3fc287a7636f428c, 0x3fc287a7636f428c, 0x3fc287a7636f428c, 0x3fc287a7636f428c, 0x3fc287a7636f428c, 0x3fc287a7636f428c + /*== poly_coeff2 ==*/ + .align 64 + .quad 0xbfcbcb7b1526e4db, 0xbfcbcb7b1526e4db, 0xbfcbcb7b1526e4db, 0xbfcbcb7b1526e4db, 0xbfcbcb7b1526e4db, 0xbfcbcb7b1526e4db, 0xbfcbcb7b1526e4db, 0xbfcbcb7b1526e4db + /*== poly_coeff1 ==*/ + .align 64 + .quad 0x3fdbcb7b1526e50e, 0x3fdbcb7b1526e50e, 0x3fdbcb7b1526e50e, 0x3fdbcb7b1526e50e, 0x3fdbcb7b1526e50e, 0x3fdbcb7b1526e50e, 0x3fdbcb7b1526e50e, 0x3fdbcb7b1526e50e + /*== L2 ==*/ + .align 64 + .quad 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff, 0x3fd34413509f79ff + .align 64 + .type __svml_dlog10_data_internal_avx512,@object + .size __svml_dlog10_data_internal_avx512,.-__svml_dlog10_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core-avx2.S new file mode 100644 index 0000000000..e389e2eca1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized log10f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_log10f _ZGVeN16v_log10f_avx2_wrapper +#include "../svml_s_log10f16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core.c new file mode 100644 index 0000000000..274fc7e0ff --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log10f, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_log10f +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_log10f, __GI__ZGVeN16v_log10f, + __redirect__ZGVeN16v_log10f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core_avx512.S new file mode 100644 index 0000000000..3dffd662ab --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f16_core_avx512.S @@ -0,0 +1,238 @@ +/* Function log10f vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log10(x) = k*log10(2.0) - log10(Rcp) + poly_approximation(R) + * log10(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog10_data_internal_avx512 + */ +#define One 0 +#define coeff4 64 +#define coeff3 128 +#define coeff2 192 +#define coeff1 256 +#define L2 320 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_log10f_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vgetmantps $11, {sae}, %zmm0, %zmm3 + vmovups __svml_slog10_data_internal_avx512(%rip), %zmm1 + vgetexpps {sae}, %zmm0, %zmm5 + vmovups L2+__svml_slog10_data_internal_avx512(%rip), %zmm10 + vpsrld $19, %zmm3, %zmm7 + vgetexpps {sae}, %zmm3, %zmm6 + vsubps {rn-sae}, %zmm1, %zmm3, %zmm11 + vpermps coeff4+__svml_slog10_data_internal_avx512(%rip), %zmm7, %zmm1 + vpermps coeff3+__svml_slog10_data_internal_avx512(%rip), %zmm7, %zmm2 + vsubps {rn-sae}, %zmm6, %zmm5, %zmm9 + vpermps coeff2+__svml_slog10_data_internal_avx512(%rip), %zmm7, %zmm4 + vpermps coeff1+__svml_slog10_data_internal_avx512(%rip), %zmm7, %zmm8 + +/* x<=0? */ + vfpclassps $94, %zmm0, %k0 + vfmadd213ps {rn-sae}, %zmm2, %zmm11, %zmm1 + vmulps {rn-sae}, %zmm10, %zmm9, %zmm12 + vfmadd213ps {rn-sae}, %zmm4, %zmm11, %zmm1 + kmovw %k0, %edx + vfmadd213ps {rn-sae}, %zmm8, %zmm11, %zmm1 + vfmadd213ps {rn-sae}, %zmm12, %zmm11, %zmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm1, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm1, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call log10f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_log10f_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_slog10_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 coeff4[16][1]; + __declspec(align(64)) VUINT32 coeff3[16][1]; + __declspec(align(64)) VUINT32 coeff2[16][1]; + __declspec(align(64)) VUINT32 coeff1[16][1]; + __declspec(align(64)) VUINT32 L2[16][1]; + } __svml_slog10_data_internal_avx512; +#endif +__svml_slog10_data_internal_avx512: + /*== One ==*/ + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + // c4 + .align 64 + .long 0xbdc9ae9b, 0xbda6fcf4 + .long 0xbd8bac76, 0xbd6bca30 + .long 0xbd48a99b, 0xbd2c0a9f + .long 0xbd1480db, 0xbd00faf2 + .long 0xbe823aa9, 0xbe656348 + .long 0xbe4afbb9, 0xbe346895 + .long 0xbe20ffff, 0xbe103a0b + .long 0xbe01a91c, 0xbde9e84e + // c3 + .align 64 + .long 0x3e13d888, 0x3e10a87c + .long 0x3e0b95c3, 0x3e057f0b + .long 0x3dfde038, 0x3df080d9 + .long 0x3de34c1e, 0x3dd68333 + .long 0x3dac6e8e, 0x3dd54a51 + .long 0x3df30f40, 0x3e04235d + .long 0x3e0b7033, 0x3e102c90 + .long 0x3e12ebad, 0x3e141ff8 + // c2 + .align 64 + .long 0xbe5e5a9b, 0xbe5e2677 + .long 0xbe5d83f5, 0xbe5c6016 + .long 0xbe5abd0b, 0xbe58a6fd + .long 0xbe562e02, 0xbe5362f8 + .long 0xbe68e27c, 0xbe646747 + .long 0xbe619a73, 0xbe5ff05a + .long 0xbe5f0570, 0xbe5e92d0 + .long 0xbe5e662b, 0xbe5e5c08 + // c1 + .align 64 + .long 0x3ede5bd8, 0x3ede5b45 + .long 0x3ede57d8, 0x3ede4eb1 + .long 0x3ede3d37, 0x3ede2166 + .long 0x3eddf9d9, 0x3eddc5bb + .long 0x3ede08ed, 0x3ede32e7 + .long 0x3ede4967, 0x3ede5490 + .long 0x3ede597f, 0x3ede5b50 + .long 0x3ede5bca, 0x3ede5bd9 + /*== L2 ==*/ + .align 64 + .long 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b + .align 64 + .type __svml_slog10_data_internal_avx512,@object + .size __svml_slog10_data_internal_avx512,.-__svml_slog10_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core-sse2.S new file mode 100644 index 0000000000..bb1cdee37e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized log10f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_log10f _ZGVbN4v_log10f_sse2 +#include "../svml_s_log10f4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core.c new file mode 100644 index 0000000000..67e9e71a76 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log10f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_log10f +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_log10f, __GI__ZGVbN4v_log10f, + __redirect__ZGVbN4v_log10f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core_sse4.S new file mode 100644 index 0000000000..88b3535d5c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f4_core_sse4.S @@ -0,0 +1,243 @@ +/* Function log10f vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log10(x) = k*log10(2.0) - log10(Rcp) + poly_approximation(R) + * log10(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog10_data_internal + */ +#define MinNorm 0 +#define MaxNorm 16 +#define L2H 32 +#define L2L 48 +#define iBrkValue 64 +#define iOffExpoMask 80 +#define One 96 +#define sPoly 112 +#define L2 256 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_log10f_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm1 + +/* reduction: compute r,n */ + movdqu iBrkValue+__svml_slog10_data_internal(%rip), %xmm2 + movaps %xmm0, %xmm4 + movdqu iOffExpoMask+__svml_slog10_data_internal(%rip), %xmm10 + psubd %xmm2, %xmm1 + pand %xmm1, %xmm10 + psrad $23, %xmm1 + paddd %xmm2, %xmm10 + movaps %xmm0, %xmm3 + movups sPoly+__svml_slog10_data_internal(%rip), %xmm5 + movups sPoly+32+__svml_slog10_data_internal(%rip), %xmm6 + movups sPoly+64+__svml_slog10_data_internal(%rip), %xmm7 + movups sPoly+96+__svml_slog10_data_internal(%rip), %xmm9 + cvtdq2ps %xmm1, %xmm12 + cmpltps MinNorm+__svml_slog10_data_internal(%rip), %xmm4 + cmpnleps MaxNorm+__svml_slog10_data_internal(%rip), %xmm3 + subps One+__svml_slog10_data_internal(%rip), %xmm10 + mulps %xmm10, %xmm5 + movaps %xmm10, %xmm8 + mulps %xmm10, %xmm6 + mulps %xmm10, %xmm8 + addps sPoly+16+__svml_slog10_data_internal(%rip), %xmm5 + mulps %xmm10, %xmm7 + addps sPoly+48+__svml_slog10_data_internal(%rip), %xmm6 + mulps %xmm10, %xmm9 + mulps %xmm8, %xmm5 + addps sPoly+80+__svml_slog10_data_internal(%rip), %xmm7 + addps sPoly+112+__svml_slog10_data_internal(%rip), %xmm9 + addps %xmm5, %xmm6 + mulps %xmm8, %xmm6 + orps %xmm3, %xmm4 + +/* combine and get argument value range mask */ + movmskps %xmm4, %edx + movups L2L+__svml_slog10_data_internal(%rip), %xmm1 + addps %xmm6, %xmm7 + mulps %xmm12, %xmm1 + mulps %xmm7, %xmm8 + movups L2H+__svml_slog10_data_internal(%rip), %xmm11 + addps %xmm8, %xmm9 + mulps %xmm11, %xmm12 + mulps %xmm10, %xmm9 + addps sPoly+128+__svml_slog10_data_internal(%rip), %xmm9 + mulps %xmm9, %xmm10 + addps %xmm10, %xmm1 + addps %xmm12, %xmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm1, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm1 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call log10f@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_log10f_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_slog10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 MinNorm[4][1]; + __declspec(align(16)) VUINT32 MaxNorm[4][1]; + __declspec(align(16)) VUINT32 L2H[4][1]; + __declspec(align(16)) VUINT32 L2L[4][1]; + __declspec(align(16)) VUINT32 iBrkValue[4][1]; + __declspec(align(16)) VUINT32 iOffExpoMask[4][1]; + __declspec(align(16)) VUINT32 One[4][1]; + __declspec(align(16)) VUINT32 sPoly[9][4][1]; + __declspec(align(16)) VUINT32 L2[4][1]; +} __svml_slog10_data_internal; +#endif +__svml_slog10_data_internal: + /*== MinNorm ==*/ + .long 0x00800000, 0x00800000, 0x00800000, 0x00800000 + /*== MaxNorm ==*/ + .align 16 + .long 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff + /*== L2H ==*/ + .align 16 + .long 0x3e9a2100, 0x3e9a2100, 0x3e9a2100, 0x3e9a2100 + /*== L2L ==*/ + .align 16 + .long 0xb64AF600, 0xb64AF600, 0xb64AF600, 0xb64AF600 + /*== iBrkValue = SP 2/3 ==*/ + .align 16 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 16 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sOne = SP 1.0 ==*/ + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== spoly[9] ==*/ + .align 16 + .long 0x3d8063B4, 0x3d8063B4, 0x3d8063B4, 0x3d8063B4 /* coeff9 */ + .long 0xbd890073, 0xbd890073, 0xbd890073, 0xbd890073 /* coeff8 */ + .long 0x3d775317, 0x3d775317, 0x3d775317, 0x3d775317 /* coeff7 */ + .long 0xbd91FB27, 0xbd91FB27, 0xbd91FB27, 0xbd91FB27 /* coeff6 */ + .long 0x3dB20B96, 0x3dB20B96, 0x3dB20B96, 0x3dB20B96 /* coeff5 */ + .long 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20 /* coeff4 */ + .long 0x3e143CE5, 0x3e143CE5, 0x3e143CE5, 0x3e143CE5 /* coeff3 */ + .long 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5 /* coeff2 */ + .long 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9 /* coeff1 */ + /*== L2 ==*/ + .align 16 + .long 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b + .align 16 + .type __svml_slog10_data_internal,@object + .size __svml_slog10_data_internal,.-__svml_slog10_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core-sse.S new file mode 100644 index 0000000000..e3467e5c90 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized log10f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_log10f _ZGVdN8v_log10f_sse_wrapper +#include "../svml_s_log10f8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core.c new file mode 100644 index 0000000000..bfd3ef6554 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log10f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_log10f +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_log10f, __GI__ZGVdN8v_log10f, + __redirect__ZGVdN8v_log10f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core_avx2.S new file mode 100644 index 0000000000..58e26342e7 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log10f8_core_avx2.S @@ -0,0 +1,243 @@ +/* Function log10f vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log10(x) = k*log10(2.0) - log10(Rcp) + poly_approximation(R) + * log10(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog10_data_internal + */ +#define MinNorm 0 +#define MaxNorm 32 +#define L2H 64 +#define L2L 96 +#define iBrkValue 128 +#define iOffExpoMask 160 +#define One 192 +#define sPoly 224 +#define L2 512 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_log10f_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + +/* reduction: compute r,n */ + vmovups iBrkValue+__svml_slog10_data_internal(%rip), %ymm4 + vmovups sPoly+__svml_slog10_data_internal(%rip), %ymm15 + vmovups sPoly+64+__svml_slog10_data_internal(%rip), %ymm9 + vmovups sPoly+128+__svml_slog10_data_internal(%rip), %ymm10 + vmovups sPoly+192+__svml_slog10_data_internal(%rip), %ymm12 + vpsubd %ymm4, %ymm0, %ymm1 + vcmplt_oqps MinNorm+__svml_slog10_data_internal(%rip), %ymm0, %ymm5 + vcmpnle_uqps MaxNorm+__svml_slog10_data_internal(%rip), %ymm0, %ymm6 + vpand iOffExpoMask+__svml_slog10_data_internal(%rip), %ymm1, %ymm3 + vpsrad $23, %ymm1, %ymm2 + vpaddd %ymm4, %ymm3, %ymm8 + vcvtdq2ps %ymm2, %ymm1 + vsubps One+__svml_slog10_data_internal(%rip), %ymm8, %ymm13 + vmulps L2L+__svml_slog10_data_internal(%rip), %ymm1, %ymm14 + vfmadd213ps sPoly+32+__svml_slog10_data_internal(%rip), %ymm13, %ymm15 + vfmadd213ps sPoly+96+__svml_slog10_data_internal(%rip), %ymm13, %ymm9 + vmulps %ymm13, %ymm13, %ymm11 + vfmadd213ps sPoly+160+__svml_slog10_data_internal(%rip), %ymm13, %ymm10 + vfmadd213ps sPoly+224+__svml_slog10_data_internal(%rip), %ymm13, %ymm12 + vfmadd213ps %ymm9, %ymm11, %ymm15 + vfmadd213ps %ymm10, %ymm11, %ymm15 + vfmadd213ps %ymm12, %ymm11, %ymm15 + vfmadd213ps sPoly+256+__svml_slog10_data_internal(%rip), %ymm13, %ymm15 + vfmadd213ps %ymm14, %ymm13, %ymm15 + vorps %ymm6, %ymm5, %ymm7 + +/* combine and get argument value range mask */ + vmovmskps %ymm7, %edx + vfmadd132ps L2H+__svml_slog10_data_internal(%rip), %ymm15, %ymm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %ymm1, %ymm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm0, 32(%rsp) + vmovups %ymm1, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call log10f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_log10f_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_slog10_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 MinNorm[8][1]; + __declspec(align(32)) VUINT32 MaxNorm[8][1]; + __declspec(align(32)) VUINT32 L2H[8][1]; + __declspec(align(32)) VUINT32 L2L[8][1]; + __declspec(align(32)) VUINT32 iBrkValue[8][1]; + __declspec(align(32)) VUINT32 iOffExpoMask[8][1]; + __declspec(align(32)) VUINT32 One[8][1]; + __declspec(align(32)) VUINT32 sPoly[9][8][1]; + __declspec(align(32)) VUINT32 L2[8][1]; +} __svml_slog10_data_internal; +#endif +__svml_slog10_data_internal: + /*== MinNorm ==*/ + .long 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000 + /*== MaxNorm ==*/ + .align 32 + .long 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff + /*== L2H ==*/ + .align 32 + .long 0x3e9a2100, 0x3e9a2100, 0x3e9a2100, 0x3e9a2100, 0x3e9a2100, 0x3e9a2100, 0x3e9a2100, 0x3e9a2100 + /*== L2L ==*/ + .align 32 + .long 0xb64AF600, 0xb64AF600, 0xb64AF600, 0xb64AF600, 0xb64AF600, 0xb64AF600, 0xb64AF600, 0xb64AF600 + /*== iBrkValue = SP 2/3 ==*/ + .align 32 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 32 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sOne = SP 1.0 ==*/ + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== spoly[9] ==*/ + .align 32 + .long 0x3d8063B4, 0x3d8063B4, 0x3d8063B4, 0x3d8063B4, 0x3d8063B4, 0x3d8063B4, 0x3d8063B4, 0x3d8063B4 /* coeff9 */ + .long 0xbd890073, 0xbd890073, 0xbd890073, 0xbd890073, 0xbd890073, 0xbd890073, 0xbd890073, 0xbd890073 /* coeff8 */ + .long 0x3d775317, 0x3d775317, 0x3d775317, 0x3d775317, 0x3d775317, 0x3d775317, 0x3d775317, 0x3d775317 /* coeff7 */ + .long 0xbd91FB27, 0xbd91FB27, 0xbd91FB27, 0xbd91FB27, 0xbd91FB27, 0xbd91FB27, 0xbd91FB27, 0xbd91FB27 /* coeff6 */ + .long 0x3dB20B96, 0x3dB20B96, 0x3dB20B96, 0x3dB20B96, 0x3dB20B96, 0x3dB20B96, 0x3dB20B96, 0x3dB20B96 /* coeff5 */ + .long 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20, 0xbdDE6E20 /* coeff4 */ + .long 0x3e143CE5, 0x3e143CE5, 0x3e143CE5, 0x3e143CE5, 0x3e143CE5, 0x3e143CE5, 0x3e143CE5, 0x3e143CE5 /* coeff3 */ + .long 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5, 0xbe5E5BC5 /* coeff2 */ + .long 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9, 0x3eDE5BD9 /* coeff1 */ + /*== L2 ==*/ + .align 32 + .long 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b, 0x3e9a209b + .align 32 + .type __svml_slog10_data_internal,@object + .size __svml_slog10_data_internal,.-__svml_slog10_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_log102_core.S b/sysdeps/x86_64/fpu/svml_d_log102_core.S new file mode 100644 index 0000000000..3d0c058ac2 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log102_core.S @@ -0,0 +1,29 @@ +/* Function log10 vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_log10) +WRAPPER_IMPL_SSE2 log10 +END (_ZGVbN2v_log10) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_log10) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_log104_core.S b/sysdeps/x86_64/fpu/svml_d_log104_core.S new file mode 100644 index 0000000000..9e32c62c0e --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log104_core.S @@ -0,0 +1,29 @@ +/* Function log10 vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_log10) +WRAPPER_IMPL_AVX _ZGVbN2v_log10 +END (_ZGVdN4v_log10) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_log10) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_log104_core_avx.S b/sysdeps/x86_64/fpu/svml_d_log104_core_avx.S new file mode 100644 index 0000000000..2b073b16f9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log104_core_avx.S @@ -0,0 +1,25 @@ +/* Function log10 vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_log10) +WRAPPER_IMPL_AVX _ZGVbN2v_log10 +END (_ZGVcN4v_log10) diff --git a/sysdeps/x86_64/fpu/svml_d_log108_core.S b/sysdeps/x86_64/fpu/svml_d_log108_core.S new file mode 100644 index 0000000000..853d791f2d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log108_core.S @@ -0,0 +1,25 @@ +/* Function log10 vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_log10) +WRAPPER_IMPL_AVX512 _ZGVdN4v_log10 +END (_ZGVeN8v_log10) diff --git a/sysdeps/x86_64/fpu/svml_s_log10f16_core.S b/sysdeps/x86_64/fpu/svml_s_log10f16_core.S new file mode 100644 index 0000000000..769603c92d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log10f16_core.S @@ -0,0 +1,25 @@ +/* Function log10f vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_log10f) +WRAPPER_IMPL_AVX512 _ZGVdN8v_log10f +END (_ZGVeN16v_log10f) diff --git a/sysdeps/x86_64/fpu/svml_s_log10f4_core.S b/sysdeps/x86_64/fpu/svml_s_log10f4_core.S new file mode 100644 index 0000000000..523525409b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log10f4_core.S @@ -0,0 +1,29 @@ +/* Function log10f vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_log10f) +WRAPPER_IMPL_SSE2 log10f +END (_ZGVbN4v_log10f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_log10f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_log10f8_core.S b/sysdeps/x86_64/fpu/svml_s_log10f8_core.S new file mode 100644 index 0000000000..630ec76b7f --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log10f8_core.S @@ -0,0 +1,29 @@ +/* Function log10f vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_log10f) +WRAPPER_IMPL_AVX _ZGVbN4v_log10f +END (_ZGVdN8v_log10f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_log10f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_log10f8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_log10f8_core_avx.S new file mode 100644 index 0000000000..374208cb2c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log10f8_core_avx.S @@ -0,0 +1,25 @@ +/* Function log10f vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_log10f) +WRAPPER_IMPL_AVX _ZGVbN4v_log10f +END (_ZGVcN8v_log10f) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx.c new file mode 100644 index 0000000000..770fd725e0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log10.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx2.c new file mode 100644 index 0000000000..770fd725e0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log10.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx512f.c new file mode 100644 index 0000000000..770fd725e0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log10-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log10.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log10.c b/sysdeps/x86_64/fpu/test-double-libmvec-log10.c new file mode 100644 index 0000000000..cb1ab36819 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log10.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC log10 +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 37a7a1c777..3dce136dfc 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVbN2v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVbN2v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVbN2v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVbN2vv_atan2) +VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVbN2v_log10) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 4313f67e06..1852625897 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVdN4v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVdN4v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVdN4v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVdN4vv_atan2) +VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVdN4v_log10) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 4b8b00f16d..cf9ea35ffe 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVcN4v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVcN4v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVcN4v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVcN4vv_atan2) +VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVcN4v_log10) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index d06522a407..b6457ea032 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1), _ZGVeN8v_expm1) VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVeN8v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVeN8v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVeN8vv_atan2) +VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVeN8v_log10) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx.c new file mode 100644 index 0000000000..04f017f1e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log10f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx2.c new file mode 100644 index 0000000000..04f017f1e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log10f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx512f.c new file mode 100644 index 0000000000..04f017f1e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log10f-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log10f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log10f.c b/sysdeps/x86_64/fpu/test-float-libmvec-log10f.c new file mode 100644 index 0000000000..682ce1e239 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log10f.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC log10f +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 0bd631bf9a..272e754e1b 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVeN16v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVeN16v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVeN16v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVeN16vv_atan2f) +VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVeN16v_log10f) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 1018398bd3..b892258b99 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVbN4v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVbN4v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVbN4v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVbN4vv_atan2f) +VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVbN4v_log10f) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 42ea28f30f..1c6ead71e1 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVdN8v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVdN8v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVdN8v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVdN8vv_atan2f) +VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVdN8v_log10f) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 70a0216a07..71f5d8d7b6 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -38,6 +38,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (expm1f), _ZGVcN8v_expm1f) VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVcN8v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVcN8v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVcN8vv_atan2f) +VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVcN8v_log10f) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573827 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=LrkxsIj8; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmNl14q0z9sPC for ; Wed, 29 Dec 2021 07:29:43 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id EA9263858439 for ; Tue, 28 Dec 2021 20:29:40 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EA9263858439 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640723381; bh=6orapGKFj6I1Vn0RuaRo8LP745dRgz39nmIXTK+alnk=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=LrkxsIj8lVFcjoAk86yJ37RsHGE+vfOtRksjYDRwyACmGwzTvygOfQncruR8cpEaD PgXZKzxrJaip7CHnQcPs4Swygh3pxupSRd1LTs+2D+uKeg20MM91HO2qqQz07TtNVA LFU8I0p11RZGzP8b5IWRDvV/6F3Ns9vpcBdDUBOc= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by sourceware.org (Postfix) with ESMTPS id 5E11E3858419 for ; Tue, 28 Dec 2021 20:11:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5E11E3858419 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="241218104" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="241218104" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="666095350" Received: from scymds01.sc.intel.com ([10.148.94.138]) by fmsmga001.fm.intel.com with ESMTP; 28 Dec 2021 12:11:32 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsg016522; Tue, 28 Dec 2021 12:11:32 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 12/18] x86-64: Add vector log2/log2f implementation to libmvec Date: Tue, 28 Dec 2021 12:11:24 -0800 Message-Id: <20211228201130.737370-13-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized log2/log2f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector log2/log2f with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_log22_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_log22_core.c | 27 + .../fpu/multiarch/svml_d_log22_core_sse4.S | 1336 +++++++++++++++++ .../fpu/multiarch/svml_d_log24_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_log24_core.c | 27 + .../fpu/multiarch/svml_d_log24_core_avx2.S | 1321 ++++++++++++++++ .../fpu/multiarch/svml_d_log28_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_log28_core.c | 27 + .../fpu/multiarch/svml_d_log28_core_avx512.S | 293 ++++ .../fpu/multiarch/svml_s_log2f16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_log2f16_core.c | 28 + .../multiarch/svml_s_log2f16_core_avx512.S | 231 +++ .../fpu/multiarch/svml_s_log2f4_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_s_log2f4_core.c | 28 + .../fpu/multiarch/svml_s_log2f4_core_sse4.S | 223 +++ .../fpu/multiarch/svml_s_log2f8_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_s_log2f8_core.c | 28 + .../fpu/multiarch/svml_s_log2f8_core_avx2.S | 226 +++ sysdeps/x86_64/fpu/svml_d_log22_core.S | 29 + sysdeps/x86_64/fpu/svml_d_log24_core.S | 29 + sysdeps/x86_64/fpu/svml_d_log24_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_log28_core.S | 25 + sysdeps/x86_64/fpu/svml_s_log2f16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_log2f4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_log2f8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_log2f8_core_avx.S | 25 + .../x86_64/fpu/test-double-libmvec-log2-avx.c | 1 + .../fpu/test-double-libmvec-log2-avx2.c | 1 + .../fpu/test-double-libmvec-log2-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-log2.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-log2f-avx.c | 1 + .../fpu/test-float-libmvec-log2f-avx2.c | 1 + .../fpu/test-float-libmvec-log2f-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-log2f.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 4202 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log22_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log22_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log22_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log24_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log24_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log24_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log28_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log28_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log28_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log22_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log24_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log24_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log28_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log2f16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log2f4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log2f8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log2f8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log2-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log2-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log2-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log2f.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 4ad584c227..73252615ca 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -230,4 +230,15 @@ #define __DECL_SIMD_log10f32x #define __DECL_SIMD_log10f64x #define __DECL_SIMD_log10f128x + +#define __DECL_SIMD_log2 +#define __DECL_SIMD_log2f +#define __DECL_SIMD_log2l +#define __DECL_SIMD_log2f16 +#define __DECL_SIMD_log2f32 +#define __DECL_SIMD_log2f64 +#define __DECL_SIMD_log2f128 +#define __DECL_SIMD_log2f32x +#define __DECL_SIMD_log2f64x +#define __DECL_SIMD_log2f128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index f21384758a..bfe52a4666 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -130,7 +130,7 @@ __MATHCALL (logb,, (_Mdouble_ __x)); __MATHCALL_VEC (exp2,, (_Mdouble_ __x)); /* Compute base-2 logarithm of X. */ -__MATHCALL (log2,, (_Mdouble_ __x)); +__MATHCALL_VEC (log2,, (_Mdouble_ __x)); #endif diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 8108a2a189..fa8b016c5d 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -55,6 +55,7 @@ GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2v_expm1 F GLIBC_2.35 _ZGVbN2v_log10 F +GLIBC_2.35 _ZGVbN2v_log2 F GLIBC_2.35 _ZGVbN2v_sinh F GLIBC_2.35 _ZGVbN2vv_atan2 F GLIBC_2.35 _ZGVbN2vv_hypot F @@ -67,6 +68,7 @@ GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4v_expm1f F GLIBC_2.35 _ZGVbN4v_log10f F +GLIBC_2.35 _ZGVbN4v_log2f F GLIBC_2.35 _ZGVbN4v_sinhf F GLIBC_2.35 _ZGVbN4vv_atan2f F GLIBC_2.35 _ZGVbN4vv_hypotf F @@ -79,6 +81,7 @@ GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4v_expm1 F GLIBC_2.35 _ZGVcN4v_log10 F +GLIBC_2.35 _ZGVcN4v_log2 F GLIBC_2.35 _ZGVcN4v_sinh F GLIBC_2.35 _ZGVcN4vv_atan2 F GLIBC_2.35 _ZGVcN4vv_hypot F @@ -91,6 +94,7 @@ GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8v_expm1f F GLIBC_2.35 _ZGVcN8v_log10f F +GLIBC_2.35 _ZGVcN8v_log2f F GLIBC_2.35 _ZGVcN8v_sinhf F GLIBC_2.35 _ZGVcN8vv_atan2f F GLIBC_2.35 _ZGVcN8vv_hypotf F @@ -103,6 +107,7 @@ GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4v_expm1 F GLIBC_2.35 _ZGVdN4v_log10 F +GLIBC_2.35 _ZGVdN4v_log2 F GLIBC_2.35 _ZGVdN4v_sinh F GLIBC_2.35 _ZGVdN4vv_atan2 F GLIBC_2.35 _ZGVdN4vv_hypot F @@ -115,6 +120,7 @@ GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8v_expm1f F GLIBC_2.35 _ZGVdN8v_log10f F +GLIBC_2.35 _ZGVdN8v_log2f F GLIBC_2.35 _ZGVdN8v_sinhf F GLIBC_2.35 _ZGVdN8vv_atan2f F GLIBC_2.35 _ZGVdN8vv_hypotf F @@ -127,6 +133,7 @@ GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16v_expm1f F GLIBC_2.35 _ZGVeN16v_log10f F +GLIBC_2.35 _ZGVeN16v_log2f F GLIBC_2.35 _ZGVeN16v_sinhf F GLIBC_2.35 _ZGVeN16vv_atan2f F GLIBC_2.35 _ZGVeN16vv_hypotf F @@ -139,6 +146,7 @@ GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8v_expm1 F GLIBC_2.35 _ZGVeN8v_log10 F +GLIBC_2.35 _ZGVeN8v_log2 F GLIBC_2.35 _ZGVeN8v_sinh F GLIBC_2.35 _ZGVeN8vv_atan2 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 64e80ada7a..59d284a10a 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -106,6 +106,10 @@ # define __DECL_SIMD_log10 __DECL_SIMD_x86_64 # undef __DECL_SIMD_log10f # define __DECL_SIMD_log10f __DECL_SIMD_x86_64 +# undef __DECL_SIMD_log2 +# define __DECL_SIMD_log2 __DECL_SIMD_x86_64 +# undef __DECL_SIMD_log2f +# define __DECL_SIMD_log2f __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index f5050c68af..a2ca9a203f 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -52,6 +52,8 @@ !GCC$ builtin (atan2f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (log10) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (log10f) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (log2) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (log2f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -89,3 +91,5 @@ !GCC$ builtin (atan2f) attributes simd (notinbranch) if('x32') !GCC$ builtin (log10) attributes simd (notinbranch) if('x32') !GCC$ builtin (log10f) attributes simd (notinbranch) if('x32') +!GCC$ builtin (log2) attributes simd (notinbranch) if('x32') +!GCC$ builtin (log2f) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index ba37044e9d..8d6d0915af 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -36,6 +36,7 @@ libmvec-funcs = \ hypot \ log \ log10 \ + log2 \ pow \ sin \ sincos \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 8beaf0736f..1b48c2d642 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -23,6 +23,7 @@ libmvec { _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2v_expm1; _ZGVcN4v_expm1; _ZGVdN4v_expm1; _ZGVeN8v_expm1; _ZGVbN2v_log10; _ZGVcN4v_log10; _ZGVdN4v_log10; _ZGVeN8v_log10; + _ZGVbN2v_log2; _ZGVcN4v_log2; _ZGVdN4v_log2; _ZGVeN8v_log2; _ZGVbN2v_sinh; _ZGVcN4v_sinh; _ZGVdN4v_sinh; _ZGVeN8v_sinh; _ZGVbN2vv_atan2; _ZGVcN4vv_atan2; _ZGVdN4vv_atan2; _ZGVeN8vv_atan2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; @@ -35,6 +36,7 @@ libmvec { _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4v_expm1f; _ZGVcN8v_expm1f; _ZGVdN8v_expm1f; _ZGVeN16v_expm1f; _ZGVbN4v_log10f; _ZGVcN8v_log10f; _ZGVdN8v_log10f; _ZGVeN16v_log10f; + _ZGVbN4v_log2f; _ZGVcN8v_log2f; _ZGVdN8v_log2f; _ZGVeN16v_log2f; _ZGVbN4v_sinhf; _ZGVcN8v_sinhf; _ZGVdN8v_sinhf; _ZGVeN16v_sinhf; _ZGVbN4vv_atan2f; _ZGVcN8vv_atan2f; _ZGVdN8vv_atan2f; _ZGVeN16vv_atan2f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index b0cd9d60ea..3b7f3cee6f 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1709,6 +1709,26 @@ float: 3 float128: 1 ldouble: 1 +Function: "log2_vlen16": +float: 1 + +Function: "log2_vlen2": +double: 1 + +Function: "log2_vlen4": +double: 1 +float: 1 + +Function: "log2_vlen4_avx2": +double: 1 + +Function: "log2_vlen8": +double: 1 +float: 1 + +Function: "log2_vlen8_avx2": +float: 1 + Function: "log_downward": float: 2 float128: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core-sse2.S new file mode 100644 index 0000000000..e0833a174b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized log2, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_log2 _ZGVbN2v_log2_sse2 +#include "../svml_d_log22_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core.c new file mode 100644 index 0000000000..6d0b5a03ca --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log2, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_log2 +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_log2, __GI__ZGVbN2v_log2, __redirect__ZGVbN2v_log2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core_sse4.S new file mode 100644 index 0000000000..6d2d6e396c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log22_core_sse4.S @@ -0,0 +1,1336 @@ +/* Function log2 vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log2(x) = k - log2(Rcp) + poly_approximation(R) + * log2(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog2_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8208 +#define poly_coeff 12320 +#define ExpMask 12400 +#define Two10 12416 +#define MinNorm 12432 +#define MaxNorm 12448 +#define HalfMask 12464 +#define One 12480 +#define Threshold 12496 +#define Bias 12512 +#define Bias1 12528 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_log2_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + +/* exponent bits */ + movaps %xmm0, %xmm5 + +/* preserve mantissa, set input exponent to 2^(-10) */ + movups ExpMask+__svml_dlog2_data_internal(%rip), %xmm1 + psrlq $20, %xmm5 + andps %xmm0, %xmm1 + lea -4218864+__svml_dlog2_data_internal(%rip), %rsi + orps Two10+__svml_dlog2_data_internal(%rip), %xmm1 + +/* check range */ + movaps %xmm0, %xmm8 + +/* reciprocal approximation good to at least 11 bits */ + cvtpd2ps %xmm1, %xmm2 + cmpltpd MinNorm+__svml_dlog2_data_internal(%rip), %xmm8 + movlhps %xmm2, %xmm2 + movaps %xmm0, %xmm7 + rcpps %xmm2, %xmm3 + cmpnlepd MaxNorm+__svml_dlog2_data_internal(%rip), %xmm7 + cvtps2pd %xmm3, %xmm12 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + movups .FLT_11(%rip), %xmm4 + orps %xmm7, %xmm8 + addpd %xmm4, %xmm12 + +/* combine and get argument value range mask */ + movmskpd %xmm8, %edx + +/* argument reduction */ + movups HalfMask+__svml_dlog2_data_internal(%rip), %xmm9 + subpd %xmm4, %xmm12 + andps %xmm1, %xmm9 + +/* + * prepare table index + * table lookup + */ + movaps %xmm12, %xmm10 + subpd %xmm9, %xmm1 + mulpd %xmm12, %xmm9 + mulpd %xmm12, %xmm1 + subpd One+__svml_dlog2_data_internal(%rip), %xmm9 + addpd %xmm9, %xmm1 + +/* polynomial */ + movups poly_coeff+__svml_dlog2_data_internal(%rip), %xmm14 + psrlq $40, %xmm10 + mulpd %xmm1, %xmm14 + movd %xmm10, %eax + pshufd $2, %xmm10, %xmm11 + movaps %xmm1, %xmm10 + movups poly_coeff+32+__svml_dlog2_data_internal(%rip), %xmm15 + mulpd %xmm1, %xmm10 + addpd poly_coeff+16+__svml_dlog2_data_internal(%rip), %xmm14 + mulpd %xmm1, %xmm15 + mulpd %xmm10, %xmm14 + addpd poly_coeff+48+__svml_dlog2_data_internal(%rip), %xmm15 + movd %xmm11, %ecx + movups poly_coeff+64+__svml_dlog2_data_internal(%rip), %xmm11 + addpd %xmm14, %xmm15 + mulpd %xmm1, %xmm11 + mulpd %xmm15, %xmm10 + +/* exponent */ + movups Threshold+__svml_dlog2_data_internal(%rip), %xmm13 + cmpltpd %xmm12, %xmm13 + addpd %xmm10, %xmm11 + pshufd $221, %xmm5, %xmm6 + +/* biased exponent in DP format */ + cvtdq2pd %xmm6, %xmm3 + movslq %eax, %rax + movslq %ecx, %rcx + andps Bias+__svml_dlog2_data_internal(%rip), %xmm13 + orps Bias1+__svml_dlog2_data_internal(%rip), %xmm13 + movsd (%rsi,%rax), %xmm2 + movhpd (%rsi,%rcx), %xmm2 + subpd %xmm13, %xmm3 + +/* reconstruction */ + addpd %xmm11, %xmm2 + addpd %xmm2, %xmm3 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm0 xmm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm3, %xmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm3, 48(%rsp) + # LOE rbx r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm3 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm3 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call log2@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN2v_log2_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dlog2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(16)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(16)) VUINT32 poly_coeff[5][2][2]; + __declspec(align(16)) VUINT32 ExpMask[2][2]; + __declspec(align(16)) VUINT32 Two10[2][2]; + __declspec(align(16)) VUINT32 MinNorm[2][2]; + __declspec(align(16)) VUINT32 MaxNorm[2][2]; + __declspec(align(16)) VUINT32 HalfMask[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 Threshold[2][2]; + __declspec(align(16)) VUINT32 Bias[2][2]; + __declspec(align(16)) VUINT32 Bias1[2][2]; +} __svml_dlog2_data_internal; +#endif +__svml_dlog2_data_internal: + /* Log_HA_table */ + .quad 0xc08ff00000000000, 0x0000000000000000 + .quad 0xc08ff0040038c920, 0x3d52bfc81744e999 + .quad 0xc08ff007ff0f0190, 0xbd59b2cedc63c895 + .quad 0xc08ff00bfc839e88, 0xbd28e365e6741d71 + .quad 0xc08ff00ff8979428, 0x3d4027998f69a77d + .quad 0xc08ff013f34bd5a0, 0x3d5dd2cb33fe6a89 + .quad 0xc08ff017eca15518, 0xbd526514cdf2c019 + .quad 0xc08ff01be49903d8, 0xbd44bfeeba165e04 + .quad 0xc08ff01fdb33d218, 0xbd3fa79ee110cec3 + .quad 0xc08ff023d072af20, 0xbd4eebb642c7fd60 + .quad 0xc08ff027c4568948, 0x3d429b13d7093443 + .quad 0xc08ff02bb6e04de8, 0x3d50f346bd36551e + .quad 0xc08ff02fa810e968, 0xbd5020bb662f1536 + .quad 0xc08ff03397e94750, 0x3d5de76b56340995 + .quad 0xc08ff037866a5218, 0x3d58065ff3304090 + .quad 0xc08ff03b7394f360, 0x3d561fc9322fb785 + .quad 0xc08ff03f5f6a13d0, 0x3d0abecd17d0d778 + .quad 0xc08ff04349ea9b28, 0xbd588f3ad0ce4d44 + .quad 0xc08ff04733177040, 0xbd4454ba4ac5f44d + .quad 0xc08ff04b1af178f8, 0xbd556f78faaa0887 + .quad 0xc08ff04f01799a58, 0x3d49db8976de7469 + .quad 0xc08ff052e6b0b868, 0xbd5cdb6fce17ef00 + .quad 0xc08ff056ca97b668, 0xbd576de8c0412f09 + .quad 0xc08ff05aad2f76a0, 0x3d30142c7ec6475c + .quad 0xc08ff05e8e78da70, 0xbd1e685afc26de72 + .quad 0xc08ff0626e74c260, 0xbd40b64c954078a3 + .quad 0xc08ff0664d240e10, 0xbd5fcde393462d7d + .quad 0xc08ff06a2a879c48, 0xbd537245eeeecc53 + .quad 0xc08ff06e06a04ae8, 0x3d4ac306eb47b436 + .quad 0xc08ff071e16ef6e8, 0xbd5a1fd9d3758f6b + .quad 0xc08ff075baf47c80, 0x3d2401fbaaa67e3c + .quad 0xc08ff0799331b6f0, 0x3d4f8dbef47a4d53 + .quad 0xc08ff07d6a2780a8, 0x3d51215e0abb42d1 + .quad 0xc08ff0813fd6b340, 0x3d57ce6249eddb35 + .quad 0xc08ff08514402770, 0xbd38a803c7083a25 + .quad 0xc08ff088e764b528, 0x3d42218beba5073e + .quad 0xc08ff08cb9453370, 0x3d447b66f1c6248f + .quad 0xc08ff09089e27880, 0xbd53d9297847e995 + .quad 0xc08ff094593d59c8, 0xbd12b6979cc77aa9 + .quad 0xc08ff0982756abd0, 0xbd55308545ecd702 + .quad 0xc08ff09bf42f4260, 0xbd578fa97c3b936f + .quad 0xc08ff09fbfc7f068, 0xbd41828408ce869d + .quad 0xc08ff0a38a218808, 0x3d555da6ce7251a6 + .quad 0xc08ff0a7533cda88, 0xbd41f3cd14bfcb02 + .quad 0xc08ff0ab1b1ab878, 0xbd1f028da6bf1852 + .quad 0xc08ff0aee1bbf188, 0xbd4cf04de3267f54 + .quad 0xc08ff0b2a72154a8, 0xbd4556e47019db10 + .quad 0xc08ff0b66b4baff8, 0x3d1e7ba00b15fbe4 + .quad 0xc08ff0ba2e3bd0d0, 0x3d5bfde1c52c2f28 + .quad 0xc08ff0bdeff283b8, 0x3d48d63fe20ee5d6 + .quad 0xc08ff0c1b0709480, 0x3d57f551980838ff + .quad 0xc08ff0c56fb6ce20, 0xbd4189091f293c81 + .quad 0xc08ff0c92dc5fae0, 0x3d4d549f05f06169 + .quad 0xc08ff0ccea9ee428, 0xbd5982466074e1e3 + .quad 0xc08ff0d0a64252b8, 0xbd5d30a6b16c0e4b + .quad 0xc08ff0d460b10e80, 0xbd3138bf3b51a201 + .quad 0xc08ff0d819ebdea8, 0xbd454e680c0801d6 + .quad 0xc08ff0dbd1f389a8, 0x3d584db361385926 + .quad 0xc08ff0df88c8d520, 0xbd564f2252a82c03 + .quad 0xc08ff0e33e6c8610, 0xbd5c78c35ed5d034 + .quad 0xc08ff0e6f2df60a8, 0xbd52eb9f29ca3d75 + .quad 0xc08ff0eaa6222860, 0x3d5340c0c01b5ff8 + .quad 0xc08ff0ee58359fe8, 0x3d10c2acaffa64b6 + .quad 0xc08ff0f2091a8948, 0xbd3fced311301ebe + .quad 0xc08ff0f5b8d1a5c8, 0x3d41ee5d591af30b + .quad 0xc08ff0f9675bb5f0, 0x3d4873546b0e668c + .quad 0xc08ff0fd14b97998, 0x3d5a99928177a119 + .quad 0xc08ff100c0ebafd8, 0x3d378ead132adcac + .quad 0xc08ff1046bf31720, 0x3d51a538bc597d48 + .quad 0xc08ff10815d06d18, 0xbd540ee2f35efd7e + .quad 0xc08ff10bbe846ec8, 0xbd59cf94753adacc + .quad 0xc08ff10f660fd878, 0xbd5201a3d6862895 + .quad 0xc08ff1130c7365c0, 0x3d383e25d0822d03 + .quad 0xc08ff116b1afd180, 0xbd0b7389bbea8f7b + .quad 0xc08ff11a55c5d5f0, 0xbd4df278087a6617 + .quad 0xc08ff11df8b62c98, 0xbd48daeb8ec01e26 + .quad 0xc08ff1219a818e50, 0x3d57c9312e0a14da + .quad 0xc08ff1253b28b330, 0xbd5f0fbc0e4d507e + .quad 0xc08ff128daac52c8, 0xbd222afdee008687 + .quad 0xc08ff12c790d23d8, 0x3d17c71747bcef8b + .quad 0xc08ff130164bdc88, 0x3d5d69cfd051af50 + .quad 0xc08ff133b2693248, 0x3d59dff064e9433a + .quad 0xc08ff1374d65d9e8, 0x3d4f71a30db3240b + .quad 0xc08ff13ae7428788, 0xbd5e56afa9524606 + .quad 0xc08ff13e7fffeeb0, 0xbd44acd84e6f8518 + .quad 0xc08ff142179ec228, 0xbd519845ade5e121 + .quad 0xc08ff145ae1fb420, 0xbd5b3b4a38ddec70 + .quad 0xc08ff14943837620, 0xbd5ea4bb5bc137c7 + .quad 0xc08ff14cd7cab910, 0x3d5610f3bf8eb6ce + .quad 0xc08ff1506af62d20, 0x3d57b1170d6184cf + .quad 0xc08ff153fd0681f0, 0x3d5791a688a3660e + .quad 0xc08ff1578dfc6678, 0x3d5d41ecf8abac2e + .quad 0xc08ff15b1dd88908, 0x3cf0bd995d64d573 + .quad 0xc08ff15eac9b9758, 0xbd5e3653cd796d01 + .quad 0xc08ff1623a463e80, 0xbd597573005ef2d8 + .quad 0xc08ff165c6d92af0, 0xbd4ee222d6439c41 + .quad 0xc08ff16952550880, 0x3d5913b845e75950 + .quad 0xc08ff16cdcba8258, 0xbd558e7ba239077e + .quad 0xc08ff170660a4328, 0x3d5a0e174a2cae66 + .quad 0xc08ff173ee44f4d8, 0x3d22b8db103db712 + .quad 0xc08ff177756b40d8, 0x3d5cc610480853c4 + .quad 0xc08ff17afb7dcfe0, 0xbd304a8bc84e5c0f + .quad 0xc08ff17e807d4a28, 0x3d3639d185da5f7d + .quad 0xc08ff182046a5738, 0xbd534705d06d788f + .quad 0xc08ff18587459e10, 0xbd540d25b28a51fd + .quad 0xc08ff189090fc510, 0xbd02d804afa7080a + .quad 0xc08ff18c89c97200, 0x3d5f2a5d305818ba + .quad 0xc08ff19009734a08, 0xbd3a602e9d05c3e4 + .quad 0xc08ff193880df1d0, 0xbd533d6fdcd54875 + .quad 0xc08ff197059a0d60, 0x3d24eaf0a9490202 + .quad 0xc08ff19a82184020, 0xbd5685666d98eb59 + .quad 0xc08ff19dfd892cf8, 0xbd509f8745f0868b + .quad 0xc08ff1a177ed7630, 0xbd2dcba340a9d268 + .quad 0xc08ff1a4f145bd80, 0x3d4916fcd0331266 + .quad 0xc08ff1a86992a408, 0xbd548cd033a49073 + .quad 0xc08ff1abe0d4ca68, 0xbd5252f40e5df1a2 + .quad 0xc08ff1af570cd0a0, 0xbd541d623bd02248 + .quad 0xc08ff1b2cc3b5628, 0xbd258dc48235c071 + .quad 0xc08ff1b64060f9e0, 0xbd4b4bd8f02ed3f2 + .quad 0xc08ff1b9b37e5a28, 0x3d4e8d20a88cd0a2 + .quad 0xc08ff1bd259414c0, 0x3d3b669b6380bc55 + .quad 0xc08ff1c096a2c6e8, 0xbd45d54159d51094 + .quad 0xc08ff1c406ab0d58, 0x3d59f684ffbca44d + .quad 0xc08ff1c775ad8428, 0x3d543b1b1d508399 + .quad 0xc08ff1cae3aac6f8, 0x3d5c30953a12fc6e + .quad 0xc08ff1ce50a370d0, 0xbd1763b04f9aad5f + .quad 0xc08ff1d1bc981c40, 0x3d573c6fa54f46c2 + .quad 0xc08ff1d527896338, 0x3d48ccfb9ffd7455 + .quad 0xc08ff1d89177df30, 0x3d42756f80d6f7ce + .quad 0xc08ff1dbfa642910, 0xbd3c2bfbc353c5a5 + .quad 0xc08ff1df624ed940, 0x3d1d6064f5dc380b + .quad 0xc08ff1e2c9388798, 0x3ce327c6b30711cf + .quad 0xc08ff1e62f21cb70, 0x3d140aa9546525bc + .quad 0xc08ff1e9940b3b98, 0xbd15c1ff43c21863 + .quad 0xc08ff1ecf7f56e60, 0x3d590ba680120498 + .quad 0xc08ff1f05ae0f988, 0x3d5390c6b62dff50 + .quad 0xc08ff1f3bcce7258, 0x3d4da0c90878457f + .quad 0xc08ff1f71dbe6d90, 0x3d30697edc85b98c + .quad 0xc08ff1fa7db17f70, 0x3d04d81188510a79 + .quad 0xc08ff1fddca83bb0, 0xbd5f2ddc983ce25c + .quad 0xc08ff2013aa33598, 0x3d46c22f0fae6844 + .quad 0xc08ff20497a2ffd0, 0xbd53359b714c3d03 + .quad 0xc08ff207f3a82ca0, 0xbd4aefaa5524f88b + .quad 0xc08ff20b4eb34dc0, 0x3d39bf4a4a73d01d + .quad 0xc08ff20ea8c4f468, 0x3d44217befdb12e6 + .quad 0xc08ff21201ddb158, 0x3d5219b281d4b6f8 + .quad 0xc08ff21559fe14c8, 0xbd5e3b123373d370 + .quad 0xc08ff218b126ae88, 0xbd59b525a6edc3cb + .quad 0xc08ff21c07580dd8, 0xbd4b494e7737c4dc + .quad 0xc08ff21f5c92c180, 0xbd3989b7d67e3e54 + .quad 0xc08ff222b0d757d0, 0x3d486c8f098ad3cf + .quad 0xc08ff22604265e98, 0x3d5254956d8e15b2 + .quad 0xc08ff22956806330, 0x3d3f14730a362959 + .quad 0xc08ff22ca7e5f278, 0xbd40e8ed02e32ea1 + .quad 0xc08ff22ff85798d8, 0xbd40fb2b9b1e0261 + .quad 0xc08ff23347d5e238, 0xbd5bfeb1e13c8bc3 + .quad 0xc08ff23696615a18, 0x3d5b891f041e037b + .quad 0xc08ff239e3fa8b60, 0xbd36255027582bb9 + .quad 0xc08ff23d30a200a8, 0x3d56bb5a92a55361 + .quad 0xc08ff2407c5843f0, 0xbd31902fb4417244 + .quad 0xc08ff243c71dded8, 0xbd5a8a7c3c4a2cc6 + .quad 0xc08ff24710f35a88, 0xbd23be1be6941016 + .quad 0xc08ff24a59d93fa8, 0x3d55c85afafa1d46 + .quad 0xc08ff24da1d01668, 0xbd5b4b05a0adcbf1 + .quad 0xc08ff250e8d866a0, 0x3d134d191476f74b + .quad 0xc08ff2542ef2b798, 0x3d5e78ce963395e1 + .quad 0xc08ff257741f9028, 0x3d3f9219a8f57c17 + .quad 0xc08ff25ab85f76c8, 0x3d5cfc6f47ac691b + .quad 0xc08ff25dfbb2f168, 0x3d4ab3b720b5ca71 + .quad 0xc08ff2613e1a8598, 0x3d54a4ab99feb71a + .quad 0xc08ff2647f96b868, 0xbd42daa69d79d724 + .quad 0xc08ff267c0280e88, 0xbd344d9115018f45 + .quad 0xc08ff26affcf0c28, 0xbd56673e143d2ac0 + .quad 0xc08ff26e3e8c3518, 0x3d3aac889e91c638 + .quad 0xc08ff2717c600ca8, 0x3d4cf65b41d006e7 + .quad 0xc08ff274b94b15c0, 0xbd4c821320391e76 + .quad 0xc08ff277f54dd2e8, 0x3d51abd6e2ddc2a1 + .quad 0xc08ff27b3068c620, 0xbd2f1bdd1264e703 + .quad 0xc08ff27e6a9c7110, 0xbd58437b4f032f15 + .quad 0xc08ff281a3e954f0, 0xbd4f8e063b069a7d + .quad 0xc08ff284dc4ff288, 0x3d5276d0723a662a + .quad 0xc08ff28813d0ca28, 0xbd5731f7c6d8f6eb + .quad 0xc08ff28b4a6c5bd0, 0xbd58b587f08307ec + .quad 0xc08ff28e80232708, 0x3d57f19a7a352baf + .quad 0xc08ff291b4f5aae0, 0x3d570d99aff32790 + .quad 0xc08ff294e8e46610, 0x3d4efafaad4f59db + .quad 0xc08ff2981befd6e0, 0xbd41eb1728371564 + .quad 0xc08ff29b4e187b38, 0x3d458465b4e080d7 + .quad 0xc08ff29e7f5ed088, 0x3d46acb4a035a820 + .quad 0xc08ff2a1afc353e0, 0xbd39fc68238dd5d3 + .quad 0xc08ff2a4df4681f0, 0x3d526d90c6750dde + .quad 0xc08ff2a80de8d6f0, 0x3d48505c598278fd + .quad 0xc08ff2ab3baacec0, 0x3d520fece8e148e8 + .quad 0xc08ff2ae688ce4d0, 0x3d14f7bf38646243 + .quad 0xc08ff2b1948f9430, 0xbd5aa5f693a627df + .quad 0xc08ff2b4bfb35790, 0xbd4725d8e6280861 + .quad 0xc08ff2b7e9f8a930, 0x3d482e0765d44bda + .quad 0xc08ff2bb136002e8, 0xbd523d745da75cde + .quad 0xc08ff2be3be9de40, 0xbd32e50b4191ef73 + .quad 0xc08ff2c16396b448, 0xbd490856dfe073b2 + .quad 0xc08ff2c48a66fdb8, 0xbd512b526137db4d + .quad 0xc08ff2c7b05b32e8, 0x3d5bfcdc71b36585 + .quad 0xc08ff2cad573cbb8, 0xbd2c24f2afddb377 + .quad 0xc08ff2cdf9b13fc0, 0xbd5ea60d06da12f6 + .quad 0xc08ff2d11d140630, 0xbd582f2f9e256dc5 + .quad 0xc08ff2d43f9c95d0, 0xbd4411c269523864 + .quad 0xc08ff2d7614b6508, 0xbd41107eeb7e1093 + .quad 0xc08ff2da8220e9e8, 0x3d5a4aa491710eda + .quad 0xc08ff2dda21d9a10, 0x3d46e50a14550378 + .quad 0xc08ff2e0c141ead0, 0xbd4881e3bd846de9 + .quad 0xc08ff2e3df8e5118, 0xbd46d93437bd399d + .quad 0xc08ff2e6fd034170, 0xbd5b4ef1e9713a4c + .quad 0xc08ff2ea19a13010, 0x3d4a0e31ed25b3ef + .quad 0xc08ff2ed356890b8, 0xbd5a7a560db90113 + .quad 0xc08ff2f05059d6f0, 0x3d51f5bb5f9072c9 + .quad 0xc08ff2f36a7575c0, 0x3d5ed5225350a585 + .quad 0xc08ff2f683bbdfe0, 0xbd1c9363d9e745db + .quad 0xc08ff2f99c2d87b8, 0x3d329c788e376e0d + .quad 0xc08ff2fcb3cadf40, 0xbd59eb5d29918de0 + .quad 0xc08ff2ffca945828, 0xbd4a86aac097a06b + .quad 0xc08ff302e08a63b8, 0x3d541c2c97e8b4d1 + .quad 0xc08ff305f5ad72d8, 0x3d43c95dec31821b + .quad 0xc08ff30909fdf620, 0xbd590abed3d72738 + .quad 0xc08ff30c1d7c5dd8, 0x3d4caefdad90e913 + .quad 0xc08ff30f302919d0, 0xbd4f7ed5e1dcb170 + .quad 0xc08ff312420499a0, 0x3d3c590edf8c3407 + .quad 0xc08ff315530f4c70, 0x3d5477d46ce838e1 + .quad 0xc08ff3186349a118, 0x3d5e4b00c511fa78 + .quad 0xc08ff31b72b40610, 0xbd54333e5a0c1658 + .quad 0xc08ff31e814ee990, 0x3d25300b88bfa10a + .quad 0xc08ff3218f1ab958, 0xbd5bfbd520249ed7 + .quad 0xc08ff3249c17e2f0, 0x3d436b1cdba645b7 + .quad 0xc08ff327a846d368, 0xbd5cb667c2f86eaa + .quad 0xc08ff32ab3a7f7a0, 0x3d5334d06a920d5f + .quad 0xc08ff32dbe3bbbf8, 0xbd5407602ab64243 + .quad 0xc08ff330c8028ca0, 0xbd52b12c9cc82316 + .quad 0xc08ff333d0fcd560, 0x3d158d7dd801324b + .quad 0xc08ff336d92b01a8, 0xbd38b55deae69564 + .quad 0xc08ff339e08d7ca0, 0x3d4a92d51dc43d43 + .quad 0xc08ff33ce724b110, 0x3d5455afbb5de008 + .quad 0xc08ff33fecf10970, 0x3d3b65694b6f87fb + .quad 0xc08ff342f1f2efe8, 0xbd3afb8ccc1260eb + .quad 0xc08ff345f62ace50, 0x3d59c98f7ec71b79 + .quad 0xc08ff348f9990e18, 0xbd5238294ff3846d + .quad 0xc08ff34bfc3e1880, 0x3d4deba7087bbf7b + .quad 0xc08ff34efe1a5650, 0xbd573e25d2d308e5 + .quad 0xc08ff351ff2e3020, 0xbd44bc302ffa76fb + .quad 0xc08ff354ff7a0e20, 0xbd2cad65891df000 + .quad 0xc08ff357fefe5838, 0x3d4b4fe326c05a8a + .quad 0xc08ff35afdbb75f8, 0x3d0fb5680f67649b + .quad 0xc08ff35dfbb1cea8, 0xbd4af509a9977e57 + .quad 0xc08ff360f8e1c940, 0x3cea69221cfb0ad6 + .quad 0xc08ff363f54bcc60, 0x3d3d116c159fead5 + .quad 0xc08ff366f0f03e58, 0xbd5e64e8bff70d5e + .quad 0xc08ff369ebcf8538, 0xbd5cc32ce5effb96 + .quad 0xc08ff36ce5ea06b8, 0x3d57bbe811e4fbda + .quad 0xc08ff36fdf402830, 0xbcf46d4595033678 + .quad 0xc08ff372d7d24ec8, 0x3d4c4bbec857b9fc + .quad 0xc08ff375cfa0df40, 0xbd59d3f339613a2d + .quad 0xc08ff378c6ac3e28, 0x3d58408e1bcb4e24 + .quad 0xc08ff37bbcf4cfa0, 0x3d5fdb793dc8e643 + .quad 0xc08ff37eb27af788, 0xbd5f0d884b401f1e + .quad 0xc08ff381a73f1988, 0xbd5a7ed37e2c50b4 + .quad 0xc08ff3849b4198e8, 0x3d5b14c1f630b2af + .quad 0xc08ff3878e82d898, 0x3d505a9abef02aff + .quad 0xc08ff38a81033b50, 0xbd4a9bbd51a7d1c4 + .quad 0xc08ff38d72c32380, 0x3d4783623464f80e + .quad 0xc08ff39063c2f338, 0xbd0e2d78f68abcc7 + .quad 0xc08ff39354030c50, 0x3d3e604763e782cb + .quad 0xc08ff3964383d048, 0xbd4514f0840b6f59 + .quad 0xc08ff3993245a060, 0xbd5488753d6035a4 + .quad 0xc08ff39c2048dd90, 0x3d5ccc099b5ff97d + .quad 0xc08ff39f0d8de870, 0x3d454ada83325c69 + .quad 0xc08ff3a1fa152168, 0x3d1e4b27fb754eb1 + .quad 0xc08ff3a4e5dee890, 0x3d58c67819ead583 + .quad 0xc08ff3a7d0eb9da8, 0xbd536d02e85d644b + .quad 0xc08ff3aabb3ba048, 0x3d5f510ab9e7c184 + .quad 0xc08ff3ada4cf4f98, 0x3d557bc5b296d5f5 + .quad 0xc08ff3b08da70a90, 0xbd48893b8f7f52c9 + .quad 0xc08ff3b375c32fe8, 0x3d5ca0b69a37d601 + .quad 0xc08ff3b65d241df0, 0xbd519c57fff86872 + .quad 0xc08ff3b943ca32d8, 0x3d048da0e3a8c3c3 + .quad 0xc08ff3bc29b5cc68, 0xbd5dd05e06ec07d0 + .quad 0xc08ff3bf0ee74840, 0x3d56c52a5c8015db + .quad 0xc08ff3c1f35f0398, 0x3d54e1dba9930bed + .quad 0xc08ff3c4d71d5b78, 0x3d2c5f679a7932b7 + .quad 0xc08ff3c7ba22aca0, 0xbd3f77628aa1aed8 + .quad 0xc08ff3cd7e03ac60, 0xbd5cc8a22f1d8591 + .quad 0xc08ff3d33f04e360, 0x3d4ae09463e13f6f + .quad 0xc08ff3d8fd292dc8, 0x3d42736efbec3922 + .quad 0xc08ff3deb8736390, 0xbce0324f8d149b09 + .quad 0xc08ff3e470e65870, 0xbd52089e4b8dd900 + .quad 0xc08ff3ea2684dbf0, 0xbd5f8e9d5dea127f + .quad 0xc08ff3efd951b970, 0xbd4b60d79db026b1 + .quad 0xc08ff3f5894fb828, 0x3d45ff1d6cea2c52 + .quad 0xc08ff3fb36819b38, 0x3d5d56022cd7f5b2 + .quad 0xc08ff400e0ea21a8, 0xbd58d63f09907b27 + .quad 0xc08ff406888c0690, 0xbd4ce6ea362f7ce0 + .quad 0xc08ff40c2d6a00f0, 0x3d519fc9ad2ef3ab + .quad 0xc08ff411cf86c3c8, 0xbd55fc89e7b55f20 + .quad 0xc08ff4176ee4fe40, 0xbd53229ca791d9be + .quad 0xc08ff41d0b875b88, 0x3d5e7733e6fb23d1 + .quad 0xc08ff422a57082e0, 0x3d5871413696b637 + .quad 0xc08ff4283ca317c0, 0x3d4b118aa7f493b9 + .quad 0xc08ff42dd121b9c8, 0x3d4bdf3692763b50 + .quad 0xc08ff43362ef04c8, 0x3d4867e17476dd63 + .quad 0xc08ff438f20d90c8, 0xbd5d49b741c778f3 + .quad 0xc08ff43e7e7ff228, 0x3d59ac35724f01e3 + .quad 0xc08ff4440848b968, 0xbd5251ccdc49432d + .quad 0xc08ff4498f6a7388, 0x3d56cf153ebc9f07 + .quad 0xc08ff44f13e7a9b8, 0x3d503b7a697a659c + .quad 0xc08ff45495c2e198, 0xbd5fa03da8acd872 + .quad 0xc08ff45a14fe9d38, 0xbd5e6cfb0b5c38fc + .quad 0xc08ff45f919d5b08, 0x3d468b1f1269f1cf + .quad 0xc08ff4650ba195e0, 0xbd313a3a8f72c0f3 + .quad 0xc08ff46a830dc528, 0x3d205d31eb8d2bd4 + .quad 0xc08ff46ff7e45cb8, 0xbd56cb8ddf5d4a90 + .quad 0xc08ff4756a27cd00, 0x3d272c2d46acdcbf + .quad 0xc08ff47ad9da82e8, 0xbd4946efab7a989d + .quad 0xc08ff48046fee800, 0xbd23fabe48cf933c + .quad 0xc08ff485b1976268, 0x3d4f03b099d80f79 + .quad 0xc08ff48b19a654e0, 0x3d4fe0c35ab7e9b5 + .quad 0xc08ff4907f2e1ed0, 0xbd54b4843f34fe09 + .quad 0xc08ff495e2311c58, 0xbd5dfa6541236a64 + .quad 0xc08ff49b42b1a648, 0x3d56fd2c8c418cbb + .quad 0xc08ff4a0a0b21218, 0x3d5e687ef208418a + .quad 0xc08ff4a5fc34b210, 0x3d4a671ce14c5521 + .quad 0xc08ff4ab553bd540, 0x3d419d0202e3cd96 + .quad 0xc08ff4b0abc9c780, 0x3d576b941a895781 + .quad 0xc08ff4b5ffe0d170, 0xbd4ea96d88cd1a30 + .quad 0xc08ff4bb518338a0, 0x3d4d6b405bd43ba6 + .quad 0xc08ff4c0a0b33f60, 0xbcf03382150a56b7 + .quad 0xc08ff4c5ed7324f8, 0xbd400df96beb0937 + .quad 0xc08ff4cb37c52590, 0xbd5c161714cdebd5 + .quad 0xc08ff4d07fab7a48, 0xbd333e8eda1a8e79 + .quad 0xc08ff4d5c5285928, 0x3d53aba20381d59f + .quad 0xc08ff4db083df530, 0xbd45e9b07af4e77c + .quad 0xc08ff4e048ee7e70, 0xbd533cfdb78a8c41 + .quad 0xc08ff4e5873c21f0, 0xbd5d9b87f4d283f2 + .quad 0xc08ff4eac32909c8, 0xbd53a677deee97fa + .quad 0xc08ff4effcb75d18, 0xbd5afd9f5dedc208 + .quad 0xc08ff4f533e94020, 0x3ce9dd794d20ab77 + .quad 0xc08ff4fa68c0d428, 0xbd5eeae84ba1cbf1 + .quad 0xc08ff4ff9b4037b0, 0xbd4f4451587282c8 + .quad 0xc08ff504cb698648, 0xbd4a1fa15087e717 + .quad 0xc08ff509f93ed8b0, 0xbd5f2f0042b9331a + .quad 0xc08ff50f24c244e0, 0xbd2c2389f8e86341 + .quad 0xc08ff5144df5ddf0, 0xbd556fcb7b48f200 + .quad 0xc08ff51974dbb448, 0x3d43ba060aa69038 + .quad 0xc08ff51e9975d578, 0x3d477ef38ca20229 + .quad 0xc08ff523bbc64c60, 0x3d49bcaf1aa4168a + .quad 0xc08ff528dbcf2120, 0xbd51c5609b60687e + .quad 0xc08ff52df9925930, 0xbd51691708d22ce7 + .quad 0xc08ff5331511f750, 0x3d30d05c98ecb3d1 + .quad 0xc08ff5382e4ffb90, 0xbd423adb056dd244 + .quad 0xc08ff53d454e6368, 0xbd3663607042da50 + .quad 0xc08ff5425a0f29a8, 0x3d42655d3c6187a6 + .quad 0xc08ff5476c944680, 0xbd028c958ae09d20 + .quad 0xc08ff54c7cdfaf90, 0xbd436eaf17756653 + .quad 0xc08ff5518af357e8, 0x3d5fbbbee66f8d24 + .quad 0xc08ff55696d12ff0, 0xbd5d93b389497880 + .quad 0xc08ff55ba07b25b0, 0xbd43ff8ff777f337 + .quad 0xc08ff560a7f32488, 0xbcf3568803ec82a4 + .quad 0xc08ff565ad3b1560, 0xbd50c83eba5cc7ea + .quad 0xc08ff56ab054deb0, 0x3d5becc2411500b7 + .quad 0xc08ff56fb1426458, 0xbd5dac964ffa8b83 + .quad 0xc08ff574b00587f0, 0x3d1d82f6cc82e69f + .quad 0xc08ff579aca02878, 0xbd34767c0d40542c + .quad 0xc08ff57ea7142298, 0xbd52d28e996ed2ce + .quad 0xc08ff5839f635090, 0xbd432a85d337086d + .quad 0xc08ff588958f8a38, 0x3d512b06ec20c7fd + .quad 0xc08ff58d899aa500, 0xbd47e2147555e10b + .quad 0xc08ff5927b867410, 0xbd4d84480a1b301d + .quad 0xc08ff5976b54c830, 0x3d5622146f3a51bd + .quad 0xc08ff59c59076fc8, 0x3d46d485c5f9c392 + .quad 0xc08ff5a144a03700, 0xbd4562714549f4fd + .quad 0xc08ff5a62e20e7b8, 0x3d541ab67e365a63 + .quad 0xc08ff5ab158b4970, 0xbd5b0855668b2369 + .quad 0xc08ff5affae12188, 0x3d27de1bc2ed4dd8 + .quad 0xc08ff5b4de243300, 0x3d40f2592d5ed454 + .quad 0xc08ff5b9bf563ea8, 0xbd4ee2f8ba7b3e9e + .quad 0xc08ff5be9e790320, 0xbd3c2214335c2164 + .quad 0xc08ff5c37b8e3cc8, 0x3d30745623ab1fd9 + .quad 0xc08ff5c85697a5d0, 0xbd326c8fb0ffde38 + .quad 0xc08ff5cd2f96f640, 0xbd4c83277493b0bc + .quad 0xc08ff5d2068de3f8, 0x3d39bb1655e6e5ba + .quad 0xc08ff5d6db7e22a8, 0x3d403170b47a5559 + .quad 0xc08ff5dbae6963e8, 0x3d5801ddf1edc325 + .quad 0xc08ff5e07f515728, 0x3d4b2704c46fe064 + .quad 0xc08ff5e54e37a9c8, 0x3d5a16e99ed6cd83 + .quad 0xc08ff5ea1b1e0700, 0xbd5353a3ac18c62f + .quad 0xc08ff5eee6061810, 0x3d567c69c189f21a + .quad 0xc08ff5f3aef18400, 0xbd50dd3220e0b0f2 + .quad 0xc08ff5f875e1eff0, 0xbd3ab64d80638db2 + .quad 0xc08ff5fd3ad8fee0, 0x3d3ec753439035aa + .quad 0xc08ff601fdd851c8, 0xbd5e10415f5f5e74 + .quad 0xc08ff606bee187b0, 0xbd55f1048b113fae + .quad 0xc08ff60b7df63d90, 0x3d1e94e4107406c8 + .quad 0xc08ff6103b180e60, 0xbd4e2eb5d0c36eb5 + .quad 0xc08ff614f6489330, 0x3d43ec5c714f709a + .quad 0xc08ff619af896308, 0x3d519ec459b62a08 + .quad 0xc08ff61e66dc1300, 0xbd5b93d09dd6161d + .quad 0xc08ff6231c423658, 0x3d5d72b849dd56be + .quad 0xc08ff627cfbd5e38, 0xbd276b7e32659173 + .quad 0xc08ff62c814f1a08, 0x3d4fd918f2e7a6b9 + .quad 0xc08ff63130f8f730, 0x3d5609ba1dcc4c97 + .quad 0xc08ff635debc8138, 0xbd55cab233dbd84c + .quad 0xc08ff63a8a9b41d8, 0xbd56778ab7aaabc9 + .quad 0xc08ff63f3496c0e0, 0x3d5b2791da49c370 + .quad 0xc08ff643dcb08438, 0x3d583063ef145f9c + .quad 0xc08ff64882ea1000, 0xbd484e9cab375fb6 + .quad 0xc08ff64d2744e688, 0xbd5c430c95c374aa + .quad 0xc08ff651c9c28848, 0xbd57a16d78490bb3 + .quad 0xc08ff6566a6473e8, 0xbd445d70374ea9ec + .quad 0xc08ff65b092c2648, 0x3d5c9729142b9d4b + .quad 0xc08ff65fa61b1a70, 0xbd4aaa179d032405 + .quad 0xc08ff6644132c9c0, 0xbd2a3ea300d173de + .quad 0xc08ff668da74abc0, 0x3d57809438efb010 + .quad 0xc08ff66d71e23630, 0xbd5e9156720951d6 + .quad 0xc08ff672077cdd30, 0xbd5bab62e8462035 + .quad 0xc08ff6769b461310, 0xbd05113545431443 + .quad 0xc08ff67b2d3f4868, 0x3d5105eb0607e59b + .quad 0xc08ff67fbd69ec18, 0xbd5e657842b37dc0 + .quad 0xc08ff6844bc76b68, 0x3d4ad1849705bc4c + .quad 0xc08ff688d85931c8, 0xbd508b6f92b6e0d6 + .quad 0xc08ff68d6320a920, 0x3d48683cceb5fdfc + .quad 0xc08ff691ec1f3990, 0xbd2c25ee290acbf5 + .quad 0xc08ff696735649a8, 0x3d58904932cd46d0 + .quad 0xc08ff69af8c73e38, 0xbd5c964167f0bfeb + .quad 0xc08ff69f7c737a90, 0xbd43d66937fa06a9 + .quad 0xc08ff6a3fe5c6040, 0xbd54bc302ffa76fb + .quad 0xc08ff6a87e834f50, 0x3d4609b1487f87a3 + .quad 0xc08ff6acfce9a618, 0xbd42c0d9af0400b1 + .quad 0xc08ff6b17990c170, 0x3d549a63973d262d + .quad 0xc08ff6b5f479fc80, 0xbd28cde894aa0641 + .quad 0xc08ff6ba6da6b0f0, 0xbd5acef617609a34 + .quad 0xc08ff6bee51836d8, 0x3d4abb9ff3cf80b8 + .quad 0xc08ff6c35acfe4a8, 0xbd53dcfa1b7697f3 + .quad 0xc08ff6c7cecf0f68, 0x3d5bcdf4aea18a55 + .quad 0xc08ff6cc41170a70, 0x3d3cad29d4324038 + .quad 0xc08ff6d0b1a927b0, 0x3d56945f9cc2a565 + .quad 0xc08ff6d52086b780, 0x3d5d20dfc1c668a7 + .quad 0xc08ff6d98db108b8, 0x3d37f20a9bcbbe04 + .quad 0xc08ff6ddf92968b8, 0x3d1e0824a6e3a4d2 + .quad 0xc08ff6e262f12358, 0xbd469f07bf6322c7 + .quad 0xc08ff6e6cb0982f8, 0xbd5cc593afdbfaef + .quad 0xc08ff6eb3173d080, 0xbd5ee68d555d7122 + .quad 0xc08ff6ef96315360, 0xbd144ee1d6a39124 + .quad 0xc08ff6f3f9435188, 0xbd40f2cb308bcd25 + .quad 0xc08ff6f85aab0f80, 0xbd5fd98ced08a73c + .quad 0xc08ff6fcba69d068, 0x3d54f2f2a1ea8606 + .quad 0xc08ff7011880d5d0, 0xbd57818234572db7 + .quad 0xc08ff70574f16008, 0x3d52429e823a9a83 + .quad 0xc08ff709cfbcadd0, 0x3d5d6dc9bb81476c + .quad 0xc08ff70e28e3fc90, 0x3d57d189e116bcb2 + .quad 0xc08ff71280688848, 0x3d0e18992809fd6d + .quad 0xc08ff716d64b8b98, 0xbd3b48ac92b8549a + .quad 0xc08ff71b2a8e3fb8, 0xbd4dcfa48040893b + .quad 0xc08ff71f7d31dc88, 0x3d58d945b8e53ef1 + .quad 0xc08ff723ce379878, 0x3d4f80faef3e15ee + .quad 0xc08ff7281da0a8b0, 0x3d53edc0fd40d18f + .quad 0xc08ff72c6b6e40f0, 0xbd4bcac66e0be72f + .quad 0xc08ff730b7a193b0, 0xbd44fcf96e2ec967 + .quad 0xc08ff735023bd208, 0x3d57e2ff34b08d86 + .quad 0xc08ff7394b3e2bb0, 0xbd4caedfb10b98dd + .quad 0xc08ff73d92a9cf28, 0xbd55db1083e5ac6a + .quad 0xc08ff741d87fe990, 0xbd580e83e6d54ed6 + .quad 0xc08ff7461cc1a6c0, 0x3d1688c83e1b0cba + .quad 0xc08ff74a5f703138, 0xbd52c398c872b701 + .quad 0xc08ff74ea08cb240, 0xbd49aabc3683b259 + .quad 0xc08ff752e01851d0, 0x3d5ccba8de72495b + .quad 0xc08ff7571e143688, 0xbd5981cf630f5793 + .quad 0xc08ff75b5a8185e8, 0xbd4f235844e01ebd + .quad 0xc08ff75f95616410, 0xbd5047de7ba8ec62 + .quad 0xc08ff763ceb4f3f0, 0x3d5fa55e004d6562 + .quad 0xc08ff768067d5720, 0xbd49f386e521a80e + .quad 0xc08ff76c3cbbae20, 0x3d3693551e62fe83 + .quad 0xc08ff77071711818, 0x3d4ba63b30b6c42c + .quad 0xc08ff774a49eb300, 0x3d4c26523d32f573 + .quad 0xc08ff778d6459b98, 0x3d3b65e70806143a + .quad 0xc08ff77d0666ed68, 0xbd5796d9c9f2c2cb + .quad 0xc08ff7813503c2d0, 0x3d33267b004b912b + .quad 0xc08ff785621d34e8, 0x3d1d5d8a23e33341 + .quad 0xc08ff7898db45ba8, 0x3d46c95233e60f40 + .quad 0xc08ff78db7ca4dd0, 0x3d362865acc8f43f + .quad 0xc08ff791e06020f8, 0xbd10e8203e161511 + .quad 0xc08ff7960776e988, 0xbd5cafe4f4467eaa + .quad 0xc08ff79a2d0fbac8, 0xbd520fddea9ea0cd + .quad 0xc08ff79e512ba6d0, 0x3d5c53d3778dae46 + .quad 0xc08ff7a273cbbe80, 0xbd5f0f6f88490367 + .quad 0xc08ff7a694f111c0, 0x3d5601aa3f55ec11 + .quad 0xc08ff7aab49caf20, 0xbd4f1a8a2328a4c4 + .quad 0xc08ff7aed2cfa438, 0xbd4a3d5341c07d0e + .quad 0xc08ff7b2ef8afd68, 0xbd5f4a1f4c525f31 + .quad 0xc08ff7b70acfc600, 0xbd4d594d77b3d775 + .quad 0xc08ff7bb249f0828, 0x3d2aef47e37e953b + .quad 0xc08ff7bf3cf9ccf0, 0x3d501803b47dfba2 + .quad 0xc08ff7c353e11c50, 0x3d5ed5ec84e5745e + .quad 0xc08ff7c76955fd20, 0xbd3de249bc9e7f96 + .quad 0xc08ff7cb7d597538, 0x3d5b5794341d1fdf + .quad 0xc08ff7cf8fec8938, 0xbd519dbd08276359 + .quad 0xc08ff7d3a1103cd0, 0xbd450129b8038848 + .quad 0xc08ff7d7b0c59288, 0x3d348f00d3bb30fd + .quad 0xc08ff7dbbf0d8bd8, 0xbd43529025720d8a + .quad 0xc08ff7dfcbe92938, 0x3d5abdaa2b1955d7 + .quad 0xc08ff7e3d75969f8, 0xbd4e8837d4588a98 + .quad 0xc08ff7e7e15f4c80, 0x3d57a782a6df5a1f + .quad 0xc08ff7ebe9fbce08, 0x3d304ba3eaa96bf1 + .quad 0xc08ff7eff12fead8, 0xbd47aab17b868a60 + .quad 0xc08ff7f3f6fc9e28, 0xbd5bd858693ba90a + .quad 0xc08ff7f7fb62e230, 0x3d26abb2c547789a + .quad 0xc08ff7fbfe63b010, 0xbd59d383d543b3f5 + .quad 0xc08ff80000000000, 0x8000000000000000 + /*== Log_LA_table ==*/ + .align 16 + .quad 0x0000000000000000 + .quad 0xbf670f83ff0a7565 + .quad 0xbf7709c46d7aac77 + .quad 0xbf8143068125dd0e + .quad 0xbf86fe50b6ef0851 + .quad 0xbf8cb6c3abd14559 + .quad 0xbf91363117a97b0c + .quad 0xbf940f9786685d29 + .quad 0xbf96e79685c2d22a + .quad 0xbf99be2f7749acc2 + .quad 0xbf9c9363ba850f86 + .quad 0xbf9f6734acf8695a + .quad 0xbfa11cd1d5133413 + .quad 0xbfa2855905ca70f6 + .quad 0xbfa3ed3094685a26 + .quad 0xbfa554592bb8cd58 + .quad 0xbfa6bad3758efd87 + .quad 0xbfa820a01ac754cb + .quad 0xbfa985bfc3495194 + .quad 0xbfaaea3316095f72 + .quad 0xbfac4dfab90aab5f + .quad 0xbfadb1175160f3b0 + .quad 0xbfaf1389833253a0 + .quad 0xbfb03aa8f8dc854c + .quad 0xbfb0eb389fa29f9b + .quad 0xbfb19b74069f5f0a + .quad 0xbfb24b5b7e135a3d + .quad 0xbfb2faef55ccb372 + .quad 0xbfb3aa2fdd27f1c3 + .quad 0xbfb4591d6310d85a + .quad 0xbfb507b836033bb7 + .quad 0xbfb5b600a40bd4f3 + .quad 0xbfb663f6fac91316 + .quad 0xbfb7119b876bea86 + .quad 0xbfb7beee96b8a281 + .quad 0xbfb86bf07507a0c7 + .quad 0xbfb918a16e46335b + .quad 0xbfb9c501cdf75872 + .quad 0xbfba7111df348494 + .quad 0xbfbb1cd1ecae66e7 + .quad 0xbfbbc84240adabba + .quad 0xbfbc73632513bd4f + .quad 0xbfbd1e34e35b82da + .quad 0xbfbdc8b7c49a1ddb + .quad 0xbfbe72ec117fa5b2 + .quad 0xbfbf1cd21257e18c + .quad 0xbfbfc66a0f0b00a5 + .quad 0xbfc037da278f2870 + .quad 0xbfc08c588cda79e4 + .quad 0xbfc0e0b05ac848ed + .quad 0xbfc134e1b489062e + .quad 0xbfc188ecbd1d16be + .quad 0xbfc1dcd197552b7b + .quad 0xbfc2309065d29791 + .quad 0xbfc284294b07a640 + .quad 0xbfc2d79c6937efdd + .quad 0xbfc32ae9e278ae1a + .quad 0xbfc37e11d8b10f89 + .quad 0xbfc3d1146d9a8a64 + .quad 0xbfc423f1c2c12ea2 + .quad 0xbfc476a9f983f74d + .quad 0xbfc4c93d33151b24 + .quad 0xbfc51bab907a5c8a + .quad 0xbfc56df5328d58c5 + .quad 0xbfc5c01a39fbd688 + .quad 0xbfc6121ac74813cf + .quad 0xbfc663f6fac91316 + .quad 0xbfc6b5aef4aae7dc + .quad 0xbfc70742d4ef027f + .quad 0xbfc758b2bb6c7b76 + .quad 0xbfc7a9fec7d05ddf + .quad 0xbfc7fb27199df16d + .quad 0xbfc84c2bd02f03b3 + .quad 0xbfc89d0d0ab430cd + .quad 0xbfc8edcae8352b6c + .quad 0xbfc93e6587910444 + .quad 0xbfc98edd077e70df + .quad 0xbfc9df31868c11d5 + .quad 0xbfca2f632320b86b + .quad 0xbfca7f71fb7bab9d + .quad 0xbfcacf5e2db4ec94 + .quad 0xbfcb1f27d7bd7a80 + .quad 0xbfcb6ecf175f95e9 + .quad 0xbfcbbe540a3f036f + .quad 0xbfcc0db6cdd94dee + .quad 0xbfcc5cf77f860826 + .quad 0xbfccac163c770dc9 + .quad 0xbfccfb1321b8c400 + .quad 0xbfcd49ee4c325970 + .quad 0xbfcd98a7d8a605a7 + .quad 0xbfcde73fe3b1480f + .quad 0xbfce35b689cd2655 + .quad 0xbfce840be74e6a4d + .quad 0xbfced2401865df52 + .quad 0xbfcf205339208f27 + .quad 0xbfcf6e456567fe55 + .quad 0xbfcfbc16b902680a + .quad 0xbfd004e3a7c97cbd + .quad 0xbfd02baba24d0664 + .quad 0xbfd0526359bab1b3 + .quad 0xbfd0790adbb03009 + .quad 0xbfd09fa235ba2020 + .quad 0xbfd0c62975542a8f + .quad 0xbfd0eca0a7e91e0b + .quad 0xbfd11307dad30b76 + .quad 0xbfd1395f1b5b61a6 + .quad 0xbfd15fa676bb08ff + .quad 0xbfd185ddfa1a7ed0 + .quad 0xbfd1ac05b291f070 + .quad 0xbfd1d21dad295632 + .quad 0xbfd1f825f6d88e13 + .quad 0xbfd21e1e9c877639 + .quad 0xbfd24407ab0e073a + .quad 0xbfd269e12f346e2c + .quad 0xbfd28fab35b32683 + .quad 0xbfd2b565cb3313b6 + .quad 0xbfd2db10fc4d9aaf + .quad 0xbfd300acd58cbb10 + .quad 0xbfd32639636b2836 + .quad 0xbfd34bb6b2546218 + .quad 0xbfd37124cea4cded + .quad 0xbfd39683c4a9ce9a + .quad 0xbfd3bbd3a0a1dcfb + .quad 0xbfd3e1146ebc9ff2 + .quad 0xbfd406463b1b0449 + .quad 0xbfd42b6911cf5465 + .quad 0xbfd4507cfedd4fc4 + .quad 0xbfd475820e3a4251 + .quad 0xbfd49a784bcd1b8b + .quad 0xbfd4bf5fc36e8577 + .quad 0xbfd4e43880e8fb6a + .quad 0xbfd509028ff8e0a2 + .quad 0xbfd52dbdfc4c96b3 + .quad 0xbfd5526ad18493ce + .quad 0xbfd577091b3378cb + .quad 0xbfd59b98e4de271c + .quad 0xbfd5c01a39fbd688 + .quad 0xbfd5e48d25f62ab9 + .quad 0xbfd608f1b42948ae + .quad 0xbfd62d47efe3ebee + .quad 0xbfd6518fe4677ba7 + .quad 0xbfd675c99ce81f92 + .quad 0xbfd699f5248cd4b8 + .quad 0xbfd6be12866f820d + .quad 0xbfd6e221cd9d0cde + .quad 0xbfd7062305156d1d + .quad 0xbfd72a1637cbc183 + .quad 0xbfd74dfb70a66388 + .quad 0xbfd771d2ba7efb3c + .quad 0xbfd7959c202292f1 + .quad 0xbfd7b957ac51aac4 + .quad 0xbfd7dd0569c04bff + .quad 0xbfd800a563161c54 + .quad 0xbfd82437a2ee70f7 + .quad 0xbfd847bc33d8618e + .quad 0xbfd86b332056db01 + .quad 0xbfd88e9c72e0b226 + .quad 0xbfd8b1f835e0b642 + .quad 0xbfd8d54673b5c372 + .quad 0xbfd8f88736b2d4e8 + .quad 0xbfd91bba891f1709 + .quad 0xbfd93ee07535f967 + .quad 0xbfd961f90527409c + .quad 0xbfd98504431717fc + .quad 0xbfd9a802391e232f + .quad 0xbfd9caf2f1498fa4 + .quad 0xbfd9edd6759b25e0 + .quad 0xbfda10acd0095ab4 + .quad 0xbfda33760a7f6051 + .quad 0xbfda56322edd3731 + .quad 0xbfda78e146f7bef4 + .quad 0xbfda9b835c98c70a + .quad 0xbfdabe18797f1f49 + .quad 0xbfdae0a0a75ea862 + .quad 0xbfdb031befe06434 + .quad 0xbfdb258a5ca28608 + .quad 0xbfdb47ebf73882a1 + .quad 0xbfdb6a40c92b203f + .quad 0xbfdb8c88dbf8867a + .quad 0xbfdbaec439144dfd + .quad 0xbfdbd0f2e9e79031 + .quad 0xbfdbf314f7d0f6ba + .quad 0xbfdc152a6c24cae6 + .quad 0xbfdc3733502d04f8 + .quad 0xbfdc592fad295b56 + .quad 0xbfdc7b1f8c4f51a4 + .quad 0xbfdc9d02f6ca47b4 + .quad 0xbfdcbed9f5bb886a + .quad 0xbfdce0a4923a587d + .quad 0xbfdd0262d554051c + .quad 0xbfdd2414c80bf27d + .quad 0xbfdd45ba735baa4f + .quad 0xbfdd6753e032ea0f + .quad 0xbfdd88e11777b149 + .quad 0xbfddaa6222064fb9 + .quad 0xbfddcbd708b17359 + .quad 0xbfdded3fd442364c + .quad 0xbfde0e9c8d782cbd + .quad 0xbfde2fed3d097298 + .quad 0xbfde5131eba2b931 + .quad 0xbfde726aa1e754d2 + .quad 0xbfde939768714a32 + .quad 0xbfdeb4b847d15bce + .quad 0xbfded5cd488f1732 + .quad 0xbfdef6d67328e220 + .quad 0xbfdf17d3d01407af + .quad 0xbfdf38c567bcc541 + .quad 0xbfdf59ab4286576c + .quad 0xbfdf7a8568cb06cf + .quad 0xbfdf9b53e2dc34c4 + .quad 0xbfdfbc16b902680a + .quad 0xbfdfdccdf37d594c + .quad 0xbfdffd799a83ff9b + .quad 0x3fdfe1e649bb6335 + .quad 0x3fdfc151b11b3640 + .quad 0x3fdfa0c8937e7d5d + .quad 0x3fdf804ae8d0cd02 + .quad 0x3fdf5fd8a9063e35 + .quad 0x3fdf3f71cc1b629c + .quad 0x3fdf1f164a15389a + .quad 0x3fdefec61b011f85 + .quad 0x3fdede8136f4cbf1 + .quad 0x3fdebe47960e3c08 + .quad 0x3fde9e193073ac06 + .quad 0x3fde7df5fe538ab3 + .quad 0x3fde5dddf7e46e0a + .quad 0x3fde3dd1156507de + .quad 0x3fde1dcf4f1c1a9e + .quad 0x3fddfdd89d586e2b + .quad 0x3fddddecf870c4c1 + .quad 0x3fddbe0c58c3cff2 + .quad 0x3fdd9e36b6b825b1 + .quad 0x3fdd7e6c0abc3579 + .quad 0x3fdd5eac4d463d7e + .quad 0x3fdd3ef776d43ff4 + .quad 0x3fdd1f4d7febf868 + .quad 0x3fdcffae611ad12b + .quad 0x3fdce01a12f5d8d1 + .quad 0x3fdcc0908e19b7bd + .quad 0x3fdca111cb2aa5c5 + .quad 0x3fdc819dc2d45fe4 + .quad 0x3fdc62346dca1dfe + .quad 0x3fdc42d5c4c688b4 + .quad 0x3fdc2381c08baf4f + .quad 0x3fdc043859e2fdb3 + .quad 0x3fdbe4f9899d326e + .quad 0x3fdbc5c5489254cc + .quad 0x3fdba69b8fa1ab02 + .quad 0x3fdb877c57b1b070 + .quad 0x3fdb686799b00be3 + .quad 0x3fdb495d4e9185f7 + .quad 0x3fdb2a5d6f51ff83 + .quad 0x3fdb0b67f4f46810 + .quad 0x3fdaec7cd882b46c + .quad 0x3fdacd9c130dd53f + .quad 0x3fdaaec59dadadbe + .quad 0x3fda8ff971810a5e + .quad 0x3fda713787ad97a5 + .quad 0x3fda527fd95fd8ff + .quad 0x3fda33d25fcb1fac + .quad 0x3fda152f142981b4 + .quad 0x3fd9f695efbbd0ef + .quad 0x3fd9d806ebc9921c + .quad 0x3fd9b98201a0f405 + .quad 0x3fd99b072a96c6b2 + .quad 0x3fd97c96600672ad + .quad 0x3fd95e2f9b51f04e + .quad 0x3fd93fd2d5e1bf1d + .quad 0x3fd921800924dd3b + .quad 0x3fd903372e90bee4 + .quad 0x3fd8e4f83fa145ee + .quad 0x3fd8c6c335d8b966 + .quad 0x3fd8a8980abfbd32 + .quad 0x3fd88a76b7e549c6 + .quad 0x3fd86c5f36dea3dc + .quad 0x3fd84e5181475449 + .quad 0x3fd8304d90c11fd3 + .quad 0x3fd812535ef3ff19 + .quad 0x3fd7f462e58e1688 + .quad 0x3fd7d67c1e43ae5c + .quad 0x3fd7b89f02cf2aad + .quad 0x3fd79acb8cf10390 + .quad 0x3fd77d01b66fbd37 + .quad 0x3fd75f417917e02c + .quad 0x3fd7418acebbf18f + .quad 0x3fd723ddb1346b65 + .quad 0x3fd7063a1a5fb4f2 + .quad 0x3fd6e8a004221b1f + .quad 0x3fd6cb0f6865c8ea + .quad 0x3fd6ad88411abfea + .quad 0x3fd6900a8836d0d5 + .quad 0x3fd6729637b59418 + .quad 0x3fd6552b49986277 + .quad 0x3fd637c9b7e64dc2 + .quad 0x3fd61a717cac1983 + .quad 0x3fd5fd2291fc33cf + .quad 0x3fd5dfdcf1eeae0e + .quad 0x3fd5c2a096a135dc + .quad 0x3fd5a56d7a370ded + .quad 0x3fd5884396d90702 + .quad 0x3fd56b22e6b578e5 + .quad 0x3fd54e0b64003b70 + .quad 0x3fd530fd08f29fa7 + .quad 0x3fd513f7cfcb68ce + .quad 0x3fd4f6fbb2cec598 + .quad 0x3fd4da08ac46495a + .quad 0x3fd4bd1eb680e548 + .quad 0x3fd4a03dcbd2e1be + .quad 0x3fd48365e695d797 + .quad 0x3fd466970128a987 + .quad 0x3fd449d115ef7d87 + .quad 0x3fd42d141f53b646 + .quad 0x3fd4106017c3eca3 + .quad 0x3fd3f3b4f9b3e939 + .quad 0x3fd3d712bf9c9def + .quad 0x3fd3ba7963fc1f8f + .quad 0x3fd39de8e1559f6f + .quad 0x3fd3816132316520 + .quad 0x3fd364e2511cc821 + .quad 0x3fd3486c38aa29a8 + .quad 0x3fd32bfee370ee68 + .quad 0x3fd30f9a4c0d786d + .quad 0x3fd2f33e6d2120f2 + .quad 0x3fd2d6eb4152324f + .quad 0x3fd2baa0c34be1ec + .quad 0x3fd29e5eedbe4a35 + .quad 0x3fd28225bb5e64a4 + .quad 0x3fd265f526e603cb + .quad 0x3fd249cd2b13cd6c + .quad 0x3fd22dadc2ab3497 + .quad 0x3fd21196e87473d1 + .quad 0x3fd1f588973c8747 + .quad 0x3fd1d982c9d52708 + .quad 0x3fd1bd857b14c146 + .quad 0x3fd1a190a5d674a0 + .quad 0x3fd185a444fa0a7b + .quad 0x3fd169c05363f158 + .quad 0x3fd14de4cbfd373e + .quad 0x3fd13211a9b38424 + .quad 0x3fd11646e7791469 + .quad 0x3fd0fa848044b351 + .quad 0x3fd0deca6f11b58b + .quad 0x3fd0c318aedff3c0 + .quad 0x3fd0a76f3ab3c52c + .quad 0x3fd08bce0d95fa38 + .quad 0x3fd070352293d724 + .quad 0x3fd054a474bf0eb7 + .quad 0x3fd0391bff2dbcf3 + .quad 0x3fd01d9bbcfa61d4 + .quad 0x3fd00223a943dc19 + .quad 0x3fcfcd677e5ac81d + .quad 0x3fcf9697f3bd0ccf + .quad 0x3fcf5fd8a9063e35 + .quad 0x3fcf29299496a889 + .quad 0x3fcef28aacd72231 + .quad 0x3fcebbfbe83901a6 + .quad 0x3fce857d3d361368 + .quad 0x3fce4f0ea2509008 + .quad 0x3fce18b00e13123d + .quad 0x3fcde26177108d03 + .quad 0x3fcdac22d3e441d3 + .quad 0x3fcd75f41b31b6dd + .quad 0x3fcd3fd543a4ad5c + .quad 0x3fcd09c643f117f0 + .quad 0x3fccd3c712d31109 + .quad 0x3fcc9dd7a70ed160 + .quad 0x3fcc67f7f770a67e + .quad 0x3fcc3227facce950 + .quad 0x3fcbfc67a7fff4cc + .quad 0x3fcbc6b6f5ee1c9b + .quad 0x3fcb9115db83a3dd + .quad 0x3fcb5b844fb4b3ef + .quad 0x3fcb2602497d5346 + .quad 0x3fcaf08fbfe15c51 + .quad 0x3fcabb2ca9ec7472 + .quad 0x3fca85d8feb202f7 + .quad 0x3fca5094b54d2828 + .quad 0x3fca1b5fc4e0b465 + .quad 0x3fc9e63a24971f46 + .quad 0x3fc9b123cba27ed3 + .quad 0x3fc97c1cb13c7ec1 + .quad 0x3fc94724cca657be + .quad 0x3fc9123c1528c6ce + .quad 0x3fc8dd62821404a9 + .quad 0x3fc8a8980abfbd32 + .quad 0x3fc873dca68b06f4 + .quad 0x3fc83f304cdc5aa7 + .quad 0x3fc80a92f5218acc + .quad 0x3fc7d60496cfbb4c + .quad 0x3fc7a18529635926 + .quad 0x3fc76d14a4601225 + .quad 0x3fc738b2ff50ccad + .quad 0x3fc7046031c79f85 + .quad 0x3fc6d01c335dc9b5 + .quad 0x3fc69be6fbb3aa6f + .quad 0x3fc667c08270b905 + .quad 0x3fc633a8bf437ce1 + .quad 0x3fc5ff9fa9e18595 + .quad 0x3fc5cba53a0762ed + .quad 0x3fc597b967789d12 + .quad 0x3fc563dc29ffacb2 + .quad 0x3fc5300d796df33a + .quad 0x3fc4fc4d4d9bb313 + .quad 0x3fc4c89b9e6807f5 + .quad 0x3fc494f863b8df35 + .quad 0x3fc46163957af02e + .quad 0x3fc42ddd2ba1b4a9 + .quad 0x3fc3fa651e276158 + .quad 0x3fc3c6fb650cde51 + .quad 0x3fc3939ff859bf9f + .quad 0x3fc36052d01c3dd7 + .quad 0x3fc32d13e4692eb7 + .quad 0x3fc2f9e32d5bfdd1 + .quad 0x3fc2c6c0a316a540 + .quad 0x3fc293ac3dc1a668 + .quad 0x3fc260a5f58c02bd + .quad 0x3fc22dadc2ab3497 + .quad 0x3fc1fac39d5b280c + .quad 0x3fc1c7e77dde33dc + .quad 0x3fc195195c7d125b + .quad 0x3fc162593186da70 + .quad 0x3fc12fa6f550f896 + .quad 0x3fc0fd02a03727ea + .quad 0x3fc0ca6c2a9b6b41 + .quad 0x3fc097e38ce60649 + .quad 0x3fc06568bf8576b3 + .quad 0x3fc032fbbaee6d65 + .quad 0x3fc0009c779bc7b5 + .quad 0x3fbf9c95dc1d1165 + .quad 0x3fbf380e2d9ba4df + .quad 0x3fbed3a1d4cdbebb + .quad 0x3fbe6f50c2d9f754 + .quad 0x3fbe0b1ae8f2fd56 + .quad 0x3fbda700385788a2 + .quad 0x3fbd4300a2524d41 + .quad 0x3fbcdf1c1839ee74 + .quad 0x3fbc7b528b70f1c5 + .quad 0x3fbc17a3ed65b23c + .quad 0x3fbbb4102f925394 + .quad 0x3fbb5097437cb58e + .quad 0x3fbaed391ab6674e + .quad 0x3fba89f5a6dc9acc + .quad 0x3fba26ccd9981853 + .quad 0x3fb9c3bea49d3214 + .quad 0x3fb960caf9abb7ca + .quad 0x3fb8fdf1ca8eea6a + .quad 0x3fb89b33091d6fe8 + .quad 0x3fb8388ea739470a + .quad 0x3fb7d60496cfbb4c + .quad 0x3fb77394c9d958d5 + .quad 0x3fb7113f3259e07a + .quad 0x3fb6af03c2603bd0 + .quad 0x3fb64ce26c067157 + .quad 0x3fb5eadb217198a3 + .quad 0x3fb588edd4d1ceaa + .quad 0x3fb5271a78622a0f + .quad 0x3fb4c560fe68af88 + .quad 0x3fb463c15936464e + .quad 0x3fb4023b7b26ac9e + .quad 0x3fb3a0cf56a06c4b + .quad 0x3fb33f7cde14cf5a + .quad 0x3fb2de4403ffd4b3 + .quad 0x3fb27d24bae824db + .quad 0x3fb21c1ef55f06c2 + .quad 0x3fb1bb32a600549d + .quad 0x3fb15a5fbf7270ce + .quad 0x3fb0f9a634663add + .quad 0x3fb09905f797047c + .quad 0x3fb0387efbca869e + .quad 0x3fafb02267a1ad2d + .quad 0x3faeef792508b69d + .quad 0x3fae2f02159384fe + .quad 0x3fad6ebd1f1febfe + .quad 0x3facaeaa27a02241 + .quad 0x3fabeec9151aac2e + .quad 0x3fab2f19cdaa46dc + .quad 0x3faa6f9c377dd31b + .quad 0x3fa9b05038d84095 + .quad 0x3fa8f135b8107912 + .quad 0x3fa8324c9b914bc7 + .quad 0x3fa77394c9d958d5 + .quad 0x3fa6b50e297afcce + .quad 0x3fa5f6b8a11c3c61 + .quad 0x3fa538941776b01e + .quad 0x3fa47aa07357704f + .quad 0x3fa3bcdd9b9f00f3 + .quad 0x3fa2ff4b77413dcb + .quad 0x3fa241e9ed454683 + .quad 0x3fa184b8e4c56af8 + .quad 0x3fa0c7b844ef1795 + .quad 0x3fa00ae7f502c1c4 + .quad 0x3f9e9c8fb8a7a900 + .quad 0x3f9d23afc49139f9 + .quad 0x3f9bab2fdcb46ec7 + .quad 0x3f9a330fd028f75f + .quad 0x3f98bb4f6e2bd536 + .quad 0x3f9743ee861f3556 + .quad 0x3f95ccece78a4a9e + .quad 0x3f94564a62192834 + .quad 0x3f92e006c59c9c29 + .quad 0x3f916a21e20a0a45 + .quad 0x3f8fe9370ef68e1b + .quad 0x3f8cfee70c5ce5dc + .quad 0x3f8a15535d0bab34 + .quad 0x3f872c7ba20f7327 + .quad 0x3f84445f7cbc8fd2 + .quad 0x3f815cfe8eaec830 + .quad 0x3f7cecb0f3922091 + .quad 0x3f7720d9c06a835f + .quad 0x3f715676c8c7a8c1 + .quad 0x3f671b0ea42e5fda + .quad 0x3f57182a894b69c6 + .quad 0x8000000000000000 + /*== poly_coeff[5] ==*/ + .align 16 + .quad 0x3fd2776E996DA1D2, 0x3fd2776E996DA1D2 /* coeff5 */ + .quad 0xbfd715494C3E7C9B, 0xbfd715494C3E7C9B /* coeff4 */ + .quad 0x3fdEC709DC39E926, 0x3fdEC709DC39E926 /* coeff3 */ + .quad 0xbfe71547652B7CF8, 0xbfe71547652B7CF8 /* coeff2 */ + .quad 0x3ff71547652B82FE, 0x3ff71547652B82FE /* coeff1 */ + /*== ExpMask ==*/ + .align 16 + .quad 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 16 + .quad 0x3f50000000000000, 0x3f50000000000000 + /*== MinNorm ==*/ + .align 16 + .quad 0x0010000000000000, 0x0010000000000000 + /*== MaxNorm ==*/ + .align 16 + .quad 0x7fefffffffffffff, 0x7fefffffffffffff + /*== HalfMask ==*/ + .align 16 + .quad 0xfffffffffc000000, 0xfffffffffc000000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== Threshold ==*/ + .align 16 + .quad 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 16 + .quad 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 16 + .quad 0x408ff00000000000, 0x408ff00000000000 + .align 16 + .type __svml_dlog2_data_internal,@object + .size __svml_dlog2_data_internal,.-__svml_dlog2_data_internal + .space 80, 0x00 + .align 16 + +.FLT_11: + .long 0x00000000,0x43380000,0x00000000,0x43380000 + .type .FLT_11,@object + .size .FLT_11,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core-sse.S new file mode 100644 index 0000000000..882ee276f2 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized log2, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_log2 _ZGVdN4v_log2_sse_wrapper +#include "../svml_d_log24_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core.c new file mode 100644 index 0000000000..7678090d11 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log2, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_log2 +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_log2, __GI__ZGVdN4v_log2, __redirect__ZGVdN4v_log2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core_avx2.S new file mode 100644 index 0000000000..e06f4481c6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log24_core_avx2.S @@ -0,0 +1,1321 @@ +/* Function log2 vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log2(x) = k - log2(Rcp) + poly_approximation(R) + * log2(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog2_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8224 +#define poly_coeff 12352 +#define ExpMask 12512 +#define Two10 12544 +#define MinNorm 12576 +#define MaxNorm 12608 +#define HalfMask 12640 +#define One 12672 +#define Threshold 12704 +#define Bias 12736 +#define Bias1 12768 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_log2_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea -4218848+__svml_dlog2_data_internal(%rip), %r8 + vmovapd %ymm0, %ymm3 + +/* preserve mantissa, set input exponent to 2^(-10) */ + vandpd ExpMask+__svml_dlog2_data_internal(%rip), %ymm3, %ymm4 + vorpd Two10+__svml_dlog2_data_internal(%rip), %ymm4, %ymm2 + +/* reciprocal approximation good to at least 11 bits */ + vcvtpd2ps %ymm2, %xmm5 + +/* exponent bits */ + vpsrlq $20, %ymm3, %ymm7 + vmovupd One+__svml_dlog2_data_internal(%rip), %ymm14 + vrcpps %xmm5, %xmm6 + +/* check range */ + vcmplt_oqpd MinNorm+__svml_dlog2_data_internal(%rip), %ymm3, %ymm11 + vcmpnle_uqpd MaxNorm+__svml_dlog2_data_internal(%rip), %ymm3, %ymm12 + vcvtps2pd %xmm6, %ymm9 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + vroundpd $0, %ymm9, %ymm1 + +/* exponent */ + vmovupd Threshold+__svml_dlog2_data_internal(%rip), %ymm9 + +/* + * prepare table index + * table lookup + */ + vpsrlq $40, %ymm1, %ymm15 + +/* argument reduction */ + vfmsub213pd %ymm14, %ymm1, %ymm2 + +/* polynomial */ + vmovupd poly_coeff+__svml_dlog2_data_internal(%rip), %ymm14 + vcmplt_oqpd %ymm1, %ymm9, %ymm1 + vfmadd213pd poly_coeff+32+__svml_dlog2_data_internal(%rip), %ymm2, %ymm14 + vorpd %ymm12, %ymm11, %ymm13 + vmulpd %ymm2, %ymm2, %ymm12 + +/* combine and get argument value range mask */ + vmovmskpd %ymm13, %eax + vextractf128 $1, %ymm7, %xmm8 + vshufps $221, %xmm8, %xmm7, %xmm10 + +/* biased exponent in DP format */ + vcvtdq2pd %xmm10, %ymm0 + vandpd Bias+__svml_dlog2_data_internal(%rip), %ymm1, %ymm10 + vorpd Bias1+__svml_dlog2_data_internal(%rip), %ymm10, %ymm11 + vsubpd %ymm11, %ymm0, %ymm1 + vmovupd poly_coeff+64+__svml_dlog2_data_internal(%rip), %ymm0 + vfmadd213pd poly_coeff+96+__svml_dlog2_data_internal(%rip), %ymm2, %ymm0 + vmulpd poly_coeff+128+__svml_dlog2_data_internal(%rip), %ymm2, %ymm2 + vfmadd213pd %ymm0, %ymm12, %ymm14 + vfmadd213pd %ymm2, %ymm12, %ymm14 + vextractf128 $1, %ymm15, %xmm6 + vmovd %xmm15, %edx + vmovd %xmm6, %esi + movslq %edx, %rdx + vpextrd $2, %xmm15, %ecx + movslq %esi, %rsi + vpextrd $2, %xmm6, %edi + movslq %ecx, %rcx + movslq %edi, %rdi + vmovsd (%r8,%rdx), %xmm4 + vmovsd (%r8,%rsi), %xmm7 + vmovhpd (%r8,%rcx), %xmm4, %xmm5 + vmovhpd (%r8,%rdi), %xmm7, %xmm8 + vinsertf128 $1, %xmm8, %ymm5, %ymm13 + +/* reconstruction */ + vaddpd %ymm14, %ymm13, %ymm0 + vaddpd %ymm0, %ymm1, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm3, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call log2@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_log2_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dlog2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(32)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(32)) VUINT32 poly_coeff[5][4][2]; + __declspec(align(32)) VUINT32 ExpMask[4][2]; + __declspec(align(32)) VUINT32 Two10[4][2]; + __declspec(align(32)) VUINT32 MinNorm[4][2]; + __declspec(align(32)) VUINT32 MaxNorm[4][2]; + __declspec(align(32)) VUINT32 HalfMask[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 Threshold[4][2]; + __declspec(align(32)) VUINT32 Bias[4][2]; + __declspec(align(32)) VUINT32 Bias1[4][2]; +} __svml_dlog2_data_internal; +#endif +__svml_dlog2_data_internal: + /* Log_HA_table */ + .quad 0xc08ff00000000000, 0x0000000000000000 + .quad 0xc08ff0040038c920, 0x3d52bfc81744e999 + .quad 0xc08ff007ff0f0190, 0xbd59b2cedc63c895 + .quad 0xc08ff00bfc839e88, 0xbd28e365e6741d71 + .quad 0xc08ff00ff8979428, 0x3d4027998f69a77d + .quad 0xc08ff013f34bd5a0, 0x3d5dd2cb33fe6a89 + .quad 0xc08ff017eca15518, 0xbd526514cdf2c019 + .quad 0xc08ff01be49903d8, 0xbd44bfeeba165e04 + .quad 0xc08ff01fdb33d218, 0xbd3fa79ee110cec3 + .quad 0xc08ff023d072af20, 0xbd4eebb642c7fd60 + .quad 0xc08ff027c4568948, 0x3d429b13d7093443 + .quad 0xc08ff02bb6e04de8, 0x3d50f346bd36551e + .quad 0xc08ff02fa810e968, 0xbd5020bb662f1536 + .quad 0xc08ff03397e94750, 0x3d5de76b56340995 + .quad 0xc08ff037866a5218, 0x3d58065ff3304090 + .quad 0xc08ff03b7394f360, 0x3d561fc9322fb785 + .quad 0xc08ff03f5f6a13d0, 0x3d0abecd17d0d778 + .quad 0xc08ff04349ea9b28, 0xbd588f3ad0ce4d44 + .quad 0xc08ff04733177040, 0xbd4454ba4ac5f44d + .quad 0xc08ff04b1af178f8, 0xbd556f78faaa0887 + .quad 0xc08ff04f01799a58, 0x3d49db8976de7469 + .quad 0xc08ff052e6b0b868, 0xbd5cdb6fce17ef00 + .quad 0xc08ff056ca97b668, 0xbd576de8c0412f09 + .quad 0xc08ff05aad2f76a0, 0x3d30142c7ec6475c + .quad 0xc08ff05e8e78da70, 0xbd1e685afc26de72 + .quad 0xc08ff0626e74c260, 0xbd40b64c954078a3 + .quad 0xc08ff0664d240e10, 0xbd5fcde393462d7d + .quad 0xc08ff06a2a879c48, 0xbd537245eeeecc53 + .quad 0xc08ff06e06a04ae8, 0x3d4ac306eb47b436 + .quad 0xc08ff071e16ef6e8, 0xbd5a1fd9d3758f6b + .quad 0xc08ff075baf47c80, 0x3d2401fbaaa67e3c + .quad 0xc08ff0799331b6f0, 0x3d4f8dbef47a4d53 + .quad 0xc08ff07d6a2780a8, 0x3d51215e0abb42d1 + .quad 0xc08ff0813fd6b340, 0x3d57ce6249eddb35 + .quad 0xc08ff08514402770, 0xbd38a803c7083a25 + .quad 0xc08ff088e764b528, 0x3d42218beba5073e + .quad 0xc08ff08cb9453370, 0x3d447b66f1c6248f + .quad 0xc08ff09089e27880, 0xbd53d9297847e995 + .quad 0xc08ff094593d59c8, 0xbd12b6979cc77aa9 + .quad 0xc08ff0982756abd0, 0xbd55308545ecd702 + .quad 0xc08ff09bf42f4260, 0xbd578fa97c3b936f + .quad 0xc08ff09fbfc7f068, 0xbd41828408ce869d + .quad 0xc08ff0a38a218808, 0x3d555da6ce7251a6 + .quad 0xc08ff0a7533cda88, 0xbd41f3cd14bfcb02 + .quad 0xc08ff0ab1b1ab878, 0xbd1f028da6bf1852 + .quad 0xc08ff0aee1bbf188, 0xbd4cf04de3267f54 + .quad 0xc08ff0b2a72154a8, 0xbd4556e47019db10 + .quad 0xc08ff0b66b4baff8, 0x3d1e7ba00b15fbe4 + .quad 0xc08ff0ba2e3bd0d0, 0x3d5bfde1c52c2f28 + .quad 0xc08ff0bdeff283b8, 0x3d48d63fe20ee5d6 + .quad 0xc08ff0c1b0709480, 0x3d57f551980838ff + .quad 0xc08ff0c56fb6ce20, 0xbd4189091f293c81 + .quad 0xc08ff0c92dc5fae0, 0x3d4d549f05f06169 + .quad 0xc08ff0ccea9ee428, 0xbd5982466074e1e3 + .quad 0xc08ff0d0a64252b8, 0xbd5d30a6b16c0e4b + .quad 0xc08ff0d460b10e80, 0xbd3138bf3b51a201 + .quad 0xc08ff0d819ebdea8, 0xbd454e680c0801d6 + .quad 0xc08ff0dbd1f389a8, 0x3d584db361385926 + .quad 0xc08ff0df88c8d520, 0xbd564f2252a82c03 + .quad 0xc08ff0e33e6c8610, 0xbd5c78c35ed5d034 + .quad 0xc08ff0e6f2df60a8, 0xbd52eb9f29ca3d75 + .quad 0xc08ff0eaa6222860, 0x3d5340c0c01b5ff8 + .quad 0xc08ff0ee58359fe8, 0x3d10c2acaffa64b6 + .quad 0xc08ff0f2091a8948, 0xbd3fced311301ebe + .quad 0xc08ff0f5b8d1a5c8, 0x3d41ee5d591af30b + .quad 0xc08ff0f9675bb5f0, 0x3d4873546b0e668c + .quad 0xc08ff0fd14b97998, 0x3d5a99928177a119 + .quad 0xc08ff100c0ebafd8, 0x3d378ead132adcac + .quad 0xc08ff1046bf31720, 0x3d51a538bc597d48 + .quad 0xc08ff10815d06d18, 0xbd540ee2f35efd7e + .quad 0xc08ff10bbe846ec8, 0xbd59cf94753adacc + .quad 0xc08ff10f660fd878, 0xbd5201a3d6862895 + .quad 0xc08ff1130c7365c0, 0x3d383e25d0822d03 + .quad 0xc08ff116b1afd180, 0xbd0b7389bbea8f7b + .quad 0xc08ff11a55c5d5f0, 0xbd4df278087a6617 + .quad 0xc08ff11df8b62c98, 0xbd48daeb8ec01e26 + .quad 0xc08ff1219a818e50, 0x3d57c9312e0a14da + .quad 0xc08ff1253b28b330, 0xbd5f0fbc0e4d507e + .quad 0xc08ff128daac52c8, 0xbd222afdee008687 + .quad 0xc08ff12c790d23d8, 0x3d17c71747bcef8b + .quad 0xc08ff130164bdc88, 0x3d5d69cfd051af50 + .quad 0xc08ff133b2693248, 0x3d59dff064e9433a + .quad 0xc08ff1374d65d9e8, 0x3d4f71a30db3240b + .quad 0xc08ff13ae7428788, 0xbd5e56afa9524606 + .quad 0xc08ff13e7fffeeb0, 0xbd44acd84e6f8518 + .quad 0xc08ff142179ec228, 0xbd519845ade5e121 + .quad 0xc08ff145ae1fb420, 0xbd5b3b4a38ddec70 + .quad 0xc08ff14943837620, 0xbd5ea4bb5bc137c7 + .quad 0xc08ff14cd7cab910, 0x3d5610f3bf8eb6ce + .quad 0xc08ff1506af62d20, 0x3d57b1170d6184cf + .quad 0xc08ff153fd0681f0, 0x3d5791a688a3660e + .quad 0xc08ff1578dfc6678, 0x3d5d41ecf8abac2e + .quad 0xc08ff15b1dd88908, 0x3cf0bd995d64d573 + .quad 0xc08ff15eac9b9758, 0xbd5e3653cd796d01 + .quad 0xc08ff1623a463e80, 0xbd597573005ef2d8 + .quad 0xc08ff165c6d92af0, 0xbd4ee222d6439c41 + .quad 0xc08ff16952550880, 0x3d5913b845e75950 + .quad 0xc08ff16cdcba8258, 0xbd558e7ba239077e + .quad 0xc08ff170660a4328, 0x3d5a0e174a2cae66 + .quad 0xc08ff173ee44f4d8, 0x3d22b8db103db712 + .quad 0xc08ff177756b40d8, 0x3d5cc610480853c4 + .quad 0xc08ff17afb7dcfe0, 0xbd304a8bc84e5c0f + .quad 0xc08ff17e807d4a28, 0x3d3639d185da5f7d + .quad 0xc08ff182046a5738, 0xbd534705d06d788f + .quad 0xc08ff18587459e10, 0xbd540d25b28a51fd + .quad 0xc08ff189090fc510, 0xbd02d804afa7080a + .quad 0xc08ff18c89c97200, 0x3d5f2a5d305818ba + .quad 0xc08ff19009734a08, 0xbd3a602e9d05c3e4 + .quad 0xc08ff193880df1d0, 0xbd533d6fdcd54875 + .quad 0xc08ff197059a0d60, 0x3d24eaf0a9490202 + .quad 0xc08ff19a82184020, 0xbd5685666d98eb59 + .quad 0xc08ff19dfd892cf8, 0xbd509f8745f0868b + .quad 0xc08ff1a177ed7630, 0xbd2dcba340a9d268 + .quad 0xc08ff1a4f145bd80, 0x3d4916fcd0331266 + .quad 0xc08ff1a86992a408, 0xbd548cd033a49073 + .quad 0xc08ff1abe0d4ca68, 0xbd5252f40e5df1a2 + .quad 0xc08ff1af570cd0a0, 0xbd541d623bd02248 + .quad 0xc08ff1b2cc3b5628, 0xbd258dc48235c071 + .quad 0xc08ff1b64060f9e0, 0xbd4b4bd8f02ed3f2 + .quad 0xc08ff1b9b37e5a28, 0x3d4e8d20a88cd0a2 + .quad 0xc08ff1bd259414c0, 0x3d3b669b6380bc55 + .quad 0xc08ff1c096a2c6e8, 0xbd45d54159d51094 + .quad 0xc08ff1c406ab0d58, 0x3d59f684ffbca44d + .quad 0xc08ff1c775ad8428, 0x3d543b1b1d508399 + .quad 0xc08ff1cae3aac6f8, 0x3d5c30953a12fc6e + .quad 0xc08ff1ce50a370d0, 0xbd1763b04f9aad5f + .quad 0xc08ff1d1bc981c40, 0x3d573c6fa54f46c2 + .quad 0xc08ff1d527896338, 0x3d48ccfb9ffd7455 + .quad 0xc08ff1d89177df30, 0x3d42756f80d6f7ce + .quad 0xc08ff1dbfa642910, 0xbd3c2bfbc353c5a5 + .quad 0xc08ff1df624ed940, 0x3d1d6064f5dc380b + .quad 0xc08ff1e2c9388798, 0x3ce327c6b30711cf + .quad 0xc08ff1e62f21cb70, 0x3d140aa9546525bc + .quad 0xc08ff1e9940b3b98, 0xbd15c1ff43c21863 + .quad 0xc08ff1ecf7f56e60, 0x3d590ba680120498 + .quad 0xc08ff1f05ae0f988, 0x3d5390c6b62dff50 + .quad 0xc08ff1f3bcce7258, 0x3d4da0c90878457f + .quad 0xc08ff1f71dbe6d90, 0x3d30697edc85b98c + .quad 0xc08ff1fa7db17f70, 0x3d04d81188510a79 + .quad 0xc08ff1fddca83bb0, 0xbd5f2ddc983ce25c + .quad 0xc08ff2013aa33598, 0x3d46c22f0fae6844 + .quad 0xc08ff20497a2ffd0, 0xbd53359b714c3d03 + .quad 0xc08ff207f3a82ca0, 0xbd4aefaa5524f88b + .quad 0xc08ff20b4eb34dc0, 0x3d39bf4a4a73d01d + .quad 0xc08ff20ea8c4f468, 0x3d44217befdb12e6 + .quad 0xc08ff21201ddb158, 0x3d5219b281d4b6f8 + .quad 0xc08ff21559fe14c8, 0xbd5e3b123373d370 + .quad 0xc08ff218b126ae88, 0xbd59b525a6edc3cb + .quad 0xc08ff21c07580dd8, 0xbd4b494e7737c4dc + .quad 0xc08ff21f5c92c180, 0xbd3989b7d67e3e54 + .quad 0xc08ff222b0d757d0, 0x3d486c8f098ad3cf + .quad 0xc08ff22604265e98, 0x3d5254956d8e15b2 + .quad 0xc08ff22956806330, 0x3d3f14730a362959 + .quad 0xc08ff22ca7e5f278, 0xbd40e8ed02e32ea1 + .quad 0xc08ff22ff85798d8, 0xbd40fb2b9b1e0261 + .quad 0xc08ff23347d5e238, 0xbd5bfeb1e13c8bc3 + .quad 0xc08ff23696615a18, 0x3d5b891f041e037b + .quad 0xc08ff239e3fa8b60, 0xbd36255027582bb9 + .quad 0xc08ff23d30a200a8, 0x3d56bb5a92a55361 + .quad 0xc08ff2407c5843f0, 0xbd31902fb4417244 + .quad 0xc08ff243c71dded8, 0xbd5a8a7c3c4a2cc6 + .quad 0xc08ff24710f35a88, 0xbd23be1be6941016 + .quad 0xc08ff24a59d93fa8, 0x3d55c85afafa1d46 + .quad 0xc08ff24da1d01668, 0xbd5b4b05a0adcbf1 + .quad 0xc08ff250e8d866a0, 0x3d134d191476f74b + .quad 0xc08ff2542ef2b798, 0x3d5e78ce963395e1 + .quad 0xc08ff257741f9028, 0x3d3f9219a8f57c17 + .quad 0xc08ff25ab85f76c8, 0x3d5cfc6f47ac691b + .quad 0xc08ff25dfbb2f168, 0x3d4ab3b720b5ca71 + .quad 0xc08ff2613e1a8598, 0x3d54a4ab99feb71a + .quad 0xc08ff2647f96b868, 0xbd42daa69d79d724 + .quad 0xc08ff267c0280e88, 0xbd344d9115018f45 + .quad 0xc08ff26affcf0c28, 0xbd56673e143d2ac0 + .quad 0xc08ff26e3e8c3518, 0x3d3aac889e91c638 + .quad 0xc08ff2717c600ca8, 0x3d4cf65b41d006e7 + .quad 0xc08ff274b94b15c0, 0xbd4c821320391e76 + .quad 0xc08ff277f54dd2e8, 0x3d51abd6e2ddc2a1 + .quad 0xc08ff27b3068c620, 0xbd2f1bdd1264e703 + .quad 0xc08ff27e6a9c7110, 0xbd58437b4f032f15 + .quad 0xc08ff281a3e954f0, 0xbd4f8e063b069a7d + .quad 0xc08ff284dc4ff288, 0x3d5276d0723a662a + .quad 0xc08ff28813d0ca28, 0xbd5731f7c6d8f6eb + .quad 0xc08ff28b4a6c5bd0, 0xbd58b587f08307ec + .quad 0xc08ff28e80232708, 0x3d57f19a7a352baf + .quad 0xc08ff291b4f5aae0, 0x3d570d99aff32790 + .quad 0xc08ff294e8e46610, 0x3d4efafaad4f59db + .quad 0xc08ff2981befd6e0, 0xbd41eb1728371564 + .quad 0xc08ff29b4e187b38, 0x3d458465b4e080d7 + .quad 0xc08ff29e7f5ed088, 0x3d46acb4a035a820 + .quad 0xc08ff2a1afc353e0, 0xbd39fc68238dd5d3 + .quad 0xc08ff2a4df4681f0, 0x3d526d90c6750dde + .quad 0xc08ff2a80de8d6f0, 0x3d48505c598278fd + .quad 0xc08ff2ab3baacec0, 0x3d520fece8e148e8 + .quad 0xc08ff2ae688ce4d0, 0x3d14f7bf38646243 + .quad 0xc08ff2b1948f9430, 0xbd5aa5f693a627df + .quad 0xc08ff2b4bfb35790, 0xbd4725d8e6280861 + .quad 0xc08ff2b7e9f8a930, 0x3d482e0765d44bda + .quad 0xc08ff2bb136002e8, 0xbd523d745da75cde + .quad 0xc08ff2be3be9de40, 0xbd32e50b4191ef73 + .quad 0xc08ff2c16396b448, 0xbd490856dfe073b2 + .quad 0xc08ff2c48a66fdb8, 0xbd512b526137db4d + .quad 0xc08ff2c7b05b32e8, 0x3d5bfcdc71b36585 + .quad 0xc08ff2cad573cbb8, 0xbd2c24f2afddb377 + .quad 0xc08ff2cdf9b13fc0, 0xbd5ea60d06da12f6 + .quad 0xc08ff2d11d140630, 0xbd582f2f9e256dc5 + .quad 0xc08ff2d43f9c95d0, 0xbd4411c269523864 + .quad 0xc08ff2d7614b6508, 0xbd41107eeb7e1093 + .quad 0xc08ff2da8220e9e8, 0x3d5a4aa491710eda + .quad 0xc08ff2dda21d9a10, 0x3d46e50a14550378 + .quad 0xc08ff2e0c141ead0, 0xbd4881e3bd846de9 + .quad 0xc08ff2e3df8e5118, 0xbd46d93437bd399d + .quad 0xc08ff2e6fd034170, 0xbd5b4ef1e9713a4c + .quad 0xc08ff2ea19a13010, 0x3d4a0e31ed25b3ef + .quad 0xc08ff2ed356890b8, 0xbd5a7a560db90113 + .quad 0xc08ff2f05059d6f0, 0x3d51f5bb5f9072c9 + .quad 0xc08ff2f36a7575c0, 0x3d5ed5225350a585 + .quad 0xc08ff2f683bbdfe0, 0xbd1c9363d9e745db + .quad 0xc08ff2f99c2d87b8, 0x3d329c788e376e0d + .quad 0xc08ff2fcb3cadf40, 0xbd59eb5d29918de0 + .quad 0xc08ff2ffca945828, 0xbd4a86aac097a06b + .quad 0xc08ff302e08a63b8, 0x3d541c2c97e8b4d1 + .quad 0xc08ff305f5ad72d8, 0x3d43c95dec31821b + .quad 0xc08ff30909fdf620, 0xbd590abed3d72738 + .quad 0xc08ff30c1d7c5dd8, 0x3d4caefdad90e913 + .quad 0xc08ff30f302919d0, 0xbd4f7ed5e1dcb170 + .quad 0xc08ff312420499a0, 0x3d3c590edf8c3407 + .quad 0xc08ff315530f4c70, 0x3d5477d46ce838e1 + .quad 0xc08ff3186349a118, 0x3d5e4b00c511fa78 + .quad 0xc08ff31b72b40610, 0xbd54333e5a0c1658 + .quad 0xc08ff31e814ee990, 0x3d25300b88bfa10a + .quad 0xc08ff3218f1ab958, 0xbd5bfbd520249ed7 + .quad 0xc08ff3249c17e2f0, 0x3d436b1cdba645b7 + .quad 0xc08ff327a846d368, 0xbd5cb667c2f86eaa + .quad 0xc08ff32ab3a7f7a0, 0x3d5334d06a920d5f + .quad 0xc08ff32dbe3bbbf8, 0xbd5407602ab64243 + .quad 0xc08ff330c8028ca0, 0xbd52b12c9cc82316 + .quad 0xc08ff333d0fcd560, 0x3d158d7dd801324b + .quad 0xc08ff336d92b01a8, 0xbd38b55deae69564 + .quad 0xc08ff339e08d7ca0, 0x3d4a92d51dc43d43 + .quad 0xc08ff33ce724b110, 0x3d5455afbb5de008 + .quad 0xc08ff33fecf10970, 0x3d3b65694b6f87fb + .quad 0xc08ff342f1f2efe8, 0xbd3afb8ccc1260eb + .quad 0xc08ff345f62ace50, 0x3d59c98f7ec71b79 + .quad 0xc08ff348f9990e18, 0xbd5238294ff3846d + .quad 0xc08ff34bfc3e1880, 0x3d4deba7087bbf7b + .quad 0xc08ff34efe1a5650, 0xbd573e25d2d308e5 + .quad 0xc08ff351ff2e3020, 0xbd44bc302ffa76fb + .quad 0xc08ff354ff7a0e20, 0xbd2cad65891df000 + .quad 0xc08ff357fefe5838, 0x3d4b4fe326c05a8a + .quad 0xc08ff35afdbb75f8, 0x3d0fb5680f67649b + .quad 0xc08ff35dfbb1cea8, 0xbd4af509a9977e57 + .quad 0xc08ff360f8e1c940, 0x3cea69221cfb0ad6 + .quad 0xc08ff363f54bcc60, 0x3d3d116c159fead5 + .quad 0xc08ff366f0f03e58, 0xbd5e64e8bff70d5e + .quad 0xc08ff369ebcf8538, 0xbd5cc32ce5effb96 + .quad 0xc08ff36ce5ea06b8, 0x3d57bbe811e4fbda + .quad 0xc08ff36fdf402830, 0xbcf46d4595033678 + .quad 0xc08ff372d7d24ec8, 0x3d4c4bbec857b9fc + .quad 0xc08ff375cfa0df40, 0xbd59d3f339613a2d + .quad 0xc08ff378c6ac3e28, 0x3d58408e1bcb4e24 + .quad 0xc08ff37bbcf4cfa0, 0x3d5fdb793dc8e643 + .quad 0xc08ff37eb27af788, 0xbd5f0d884b401f1e + .quad 0xc08ff381a73f1988, 0xbd5a7ed37e2c50b4 + .quad 0xc08ff3849b4198e8, 0x3d5b14c1f630b2af + .quad 0xc08ff3878e82d898, 0x3d505a9abef02aff + .quad 0xc08ff38a81033b50, 0xbd4a9bbd51a7d1c4 + .quad 0xc08ff38d72c32380, 0x3d4783623464f80e + .quad 0xc08ff39063c2f338, 0xbd0e2d78f68abcc7 + .quad 0xc08ff39354030c50, 0x3d3e604763e782cb + .quad 0xc08ff3964383d048, 0xbd4514f0840b6f59 + .quad 0xc08ff3993245a060, 0xbd5488753d6035a4 + .quad 0xc08ff39c2048dd90, 0x3d5ccc099b5ff97d + .quad 0xc08ff39f0d8de870, 0x3d454ada83325c69 + .quad 0xc08ff3a1fa152168, 0x3d1e4b27fb754eb1 + .quad 0xc08ff3a4e5dee890, 0x3d58c67819ead583 + .quad 0xc08ff3a7d0eb9da8, 0xbd536d02e85d644b + .quad 0xc08ff3aabb3ba048, 0x3d5f510ab9e7c184 + .quad 0xc08ff3ada4cf4f98, 0x3d557bc5b296d5f5 + .quad 0xc08ff3b08da70a90, 0xbd48893b8f7f52c9 + .quad 0xc08ff3b375c32fe8, 0x3d5ca0b69a37d601 + .quad 0xc08ff3b65d241df0, 0xbd519c57fff86872 + .quad 0xc08ff3b943ca32d8, 0x3d048da0e3a8c3c3 + .quad 0xc08ff3bc29b5cc68, 0xbd5dd05e06ec07d0 + .quad 0xc08ff3bf0ee74840, 0x3d56c52a5c8015db + .quad 0xc08ff3c1f35f0398, 0x3d54e1dba9930bed + .quad 0xc08ff3c4d71d5b78, 0x3d2c5f679a7932b7 + .quad 0xc08ff3c7ba22aca0, 0xbd3f77628aa1aed8 + .quad 0xc08ff3cd7e03ac60, 0xbd5cc8a22f1d8591 + .quad 0xc08ff3d33f04e360, 0x3d4ae09463e13f6f + .quad 0xc08ff3d8fd292dc8, 0x3d42736efbec3922 + .quad 0xc08ff3deb8736390, 0xbce0324f8d149b09 + .quad 0xc08ff3e470e65870, 0xbd52089e4b8dd900 + .quad 0xc08ff3ea2684dbf0, 0xbd5f8e9d5dea127f + .quad 0xc08ff3efd951b970, 0xbd4b60d79db026b1 + .quad 0xc08ff3f5894fb828, 0x3d45ff1d6cea2c52 + .quad 0xc08ff3fb36819b38, 0x3d5d56022cd7f5b2 + .quad 0xc08ff400e0ea21a8, 0xbd58d63f09907b27 + .quad 0xc08ff406888c0690, 0xbd4ce6ea362f7ce0 + .quad 0xc08ff40c2d6a00f0, 0x3d519fc9ad2ef3ab + .quad 0xc08ff411cf86c3c8, 0xbd55fc89e7b55f20 + .quad 0xc08ff4176ee4fe40, 0xbd53229ca791d9be + .quad 0xc08ff41d0b875b88, 0x3d5e7733e6fb23d1 + .quad 0xc08ff422a57082e0, 0x3d5871413696b637 + .quad 0xc08ff4283ca317c0, 0x3d4b118aa7f493b9 + .quad 0xc08ff42dd121b9c8, 0x3d4bdf3692763b50 + .quad 0xc08ff43362ef04c8, 0x3d4867e17476dd63 + .quad 0xc08ff438f20d90c8, 0xbd5d49b741c778f3 + .quad 0xc08ff43e7e7ff228, 0x3d59ac35724f01e3 + .quad 0xc08ff4440848b968, 0xbd5251ccdc49432d + .quad 0xc08ff4498f6a7388, 0x3d56cf153ebc9f07 + .quad 0xc08ff44f13e7a9b8, 0x3d503b7a697a659c + .quad 0xc08ff45495c2e198, 0xbd5fa03da8acd872 + .quad 0xc08ff45a14fe9d38, 0xbd5e6cfb0b5c38fc + .quad 0xc08ff45f919d5b08, 0x3d468b1f1269f1cf + .quad 0xc08ff4650ba195e0, 0xbd313a3a8f72c0f3 + .quad 0xc08ff46a830dc528, 0x3d205d31eb8d2bd4 + .quad 0xc08ff46ff7e45cb8, 0xbd56cb8ddf5d4a90 + .quad 0xc08ff4756a27cd00, 0x3d272c2d46acdcbf + .quad 0xc08ff47ad9da82e8, 0xbd4946efab7a989d + .quad 0xc08ff48046fee800, 0xbd23fabe48cf933c + .quad 0xc08ff485b1976268, 0x3d4f03b099d80f79 + .quad 0xc08ff48b19a654e0, 0x3d4fe0c35ab7e9b5 + .quad 0xc08ff4907f2e1ed0, 0xbd54b4843f34fe09 + .quad 0xc08ff495e2311c58, 0xbd5dfa6541236a64 + .quad 0xc08ff49b42b1a648, 0x3d56fd2c8c418cbb + .quad 0xc08ff4a0a0b21218, 0x3d5e687ef208418a + .quad 0xc08ff4a5fc34b210, 0x3d4a671ce14c5521 + .quad 0xc08ff4ab553bd540, 0x3d419d0202e3cd96 + .quad 0xc08ff4b0abc9c780, 0x3d576b941a895781 + .quad 0xc08ff4b5ffe0d170, 0xbd4ea96d88cd1a30 + .quad 0xc08ff4bb518338a0, 0x3d4d6b405bd43ba6 + .quad 0xc08ff4c0a0b33f60, 0xbcf03382150a56b7 + .quad 0xc08ff4c5ed7324f8, 0xbd400df96beb0937 + .quad 0xc08ff4cb37c52590, 0xbd5c161714cdebd5 + .quad 0xc08ff4d07fab7a48, 0xbd333e8eda1a8e79 + .quad 0xc08ff4d5c5285928, 0x3d53aba20381d59f + .quad 0xc08ff4db083df530, 0xbd45e9b07af4e77c + .quad 0xc08ff4e048ee7e70, 0xbd533cfdb78a8c41 + .quad 0xc08ff4e5873c21f0, 0xbd5d9b87f4d283f2 + .quad 0xc08ff4eac32909c8, 0xbd53a677deee97fa + .quad 0xc08ff4effcb75d18, 0xbd5afd9f5dedc208 + .quad 0xc08ff4f533e94020, 0x3ce9dd794d20ab77 + .quad 0xc08ff4fa68c0d428, 0xbd5eeae84ba1cbf1 + .quad 0xc08ff4ff9b4037b0, 0xbd4f4451587282c8 + .quad 0xc08ff504cb698648, 0xbd4a1fa15087e717 + .quad 0xc08ff509f93ed8b0, 0xbd5f2f0042b9331a + .quad 0xc08ff50f24c244e0, 0xbd2c2389f8e86341 + .quad 0xc08ff5144df5ddf0, 0xbd556fcb7b48f200 + .quad 0xc08ff51974dbb448, 0x3d43ba060aa69038 + .quad 0xc08ff51e9975d578, 0x3d477ef38ca20229 + .quad 0xc08ff523bbc64c60, 0x3d49bcaf1aa4168a + .quad 0xc08ff528dbcf2120, 0xbd51c5609b60687e + .quad 0xc08ff52df9925930, 0xbd51691708d22ce7 + .quad 0xc08ff5331511f750, 0x3d30d05c98ecb3d1 + .quad 0xc08ff5382e4ffb90, 0xbd423adb056dd244 + .quad 0xc08ff53d454e6368, 0xbd3663607042da50 + .quad 0xc08ff5425a0f29a8, 0x3d42655d3c6187a6 + .quad 0xc08ff5476c944680, 0xbd028c958ae09d20 + .quad 0xc08ff54c7cdfaf90, 0xbd436eaf17756653 + .quad 0xc08ff5518af357e8, 0x3d5fbbbee66f8d24 + .quad 0xc08ff55696d12ff0, 0xbd5d93b389497880 + .quad 0xc08ff55ba07b25b0, 0xbd43ff8ff777f337 + .quad 0xc08ff560a7f32488, 0xbcf3568803ec82a4 + .quad 0xc08ff565ad3b1560, 0xbd50c83eba5cc7ea + .quad 0xc08ff56ab054deb0, 0x3d5becc2411500b7 + .quad 0xc08ff56fb1426458, 0xbd5dac964ffa8b83 + .quad 0xc08ff574b00587f0, 0x3d1d82f6cc82e69f + .quad 0xc08ff579aca02878, 0xbd34767c0d40542c + .quad 0xc08ff57ea7142298, 0xbd52d28e996ed2ce + .quad 0xc08ff5839f635090, 0xbd432a85d337086d + .quad 0xc08ff588958f8a38, 0x3d512b06ec20c7fd + .quad 0xc08ff58d899aa500, 0xbd47e2147555e10b + .quad 0xc08ff5927b867410, 0xbd4d84480a1b301d + .quad 0xc08ff5976b54c830, 0x3d5622146f3a51bd + .quad 0xc08ff59c59076fc8, 0x3d46d485c5f9c392 + .quad 0xc08ff5a144a03700, 0xbd4562714549f4fd + .quad 0xc08ff5a62e20e7b8, 0x3d541ab67e365a63 + .quad 0xc08ff5ab158b4970, 0xbd5b0855668b2369 + .quad 0xc08ff5affae12188, 0x3d27de1bc2ed4dd8 + .quad 0xc08ff5b4de243300, 0x3d40f2592d5ed454 + .quad 0xc08ff5b9bf563ea8, 0xbd4ee2f8ba7b3e9e + .quad 0xc08ff5be9e790320, 0xbd3c2214335c2164 + .quad 0xc08ff5c37b8e3cc8, 0x3d30745623ab1fd9 + .quad 0xc08ff5c85697a5d0, 0xbd326c8fb0ffde38 + .quad 0xc08ff5cd2f96f640, 0xbd4c83277493b0bc + .quad 0xc08ff5d2068de3f8, 0x3d39bb1655e6e5ba + .quad 0xc08ff5d6db7e22a8, 0x3d403170b47a5559 + .quad 0xc08ff5dbae6963e8, 0x3d5801ddf1edc325 + .quad 0xc08ff5e07f515728, 0x3d4b2704c46fe064 + .quad 0xc08ff5e54e37a9c8, 0x3d5a16e99ed6cd83 + .quad 0xc08ff5ea1b1e0700, 0xbd5353a3ac18c62f + .quad 0xc08ff5eee6061810, 0x3d567c69c189f21a + .quad 0xc08ff5f3aef18400, 0xbd50dd3220e0b0f2 + .quad 0xc08ff5f875e1eff0, 0xbd3ab64d80638db2 + .quad 0xc08ff5fd3ad8fee0, 0x3d3ec753439035aa + .quad 0xc08ff601fdd851c8, 0xbd5e10415f5f5e74 + .quad 0xc08ff606bee187b0, 0xbd55f1048b113fae + .quad 0xc08ff60b7df63d90, 0x3d1e94e4107406c8 + .quad 0xc08ff6103b180e60, 0xbd4e2eb5d0c36eb5 + .quad 0xc08ff614f6489330, 0x3d43ec5c714f709a + .quad 0xc08ff619af896308, 0x3d519ec459b62a08 + .quad 0xc08ff61e66dc1300, 0xbd5b93d09dd6161d + .quad 0xc08ff6231c423658, 0x3d5d72b849dd56be + .quad 0xc08ff627cfbd5e38, 0xbd276b7e32659173 + .quad 0xc08ff62c814f1a08, 0x3d4fd918f2e7a6b9 + .quad 0xc08ff63130f8f730, 0x3d5609ba1dcc4c97 + .quad 0xc08ff635debc8138, 0xbd55cab233dbd84c + .quad 0xc08ff63a8a9b41d8, 0xbd56778ab7aaabc9 + .quad 0xc08ff63f3496c0e0, 0x3d5b2791da49c370 + .quad 0xc08ff643dcb08438, 0x3d583063ef145f9c + .quad 0xc08ff64882ea1000, 0xbd484e9cab375fb6 + .quad 0xc08ff64d2744e688, 0xbd5c430c95c374aa + .quad 0xc08ff651c9c28848, 0xbd57a16d78490bb3 + .quad 0xc08ff6566a6473e8, 0xbd445d70374ea9ec + .quad 0xc08ff65b092c2648, 0x3d5c9729142b9d4b + .quad 0xc08ff65fa61b1a70, 0xbd4aaa179d032405 + .quad 0xc08ff6644132c9c0, 0xbd2a3ea300d173de + .quad 0xc08ff668da74abc0, 0x3d57809438efb010 + .quad 0xc08ff66d71e23630, 0xbd5e9156720951d6 + .quad 0xc08ff672077cdd30, 0xbd5bab62e8462035 + .quad 0xc08ff6769b461310, 0xbd05113545431443 + .quad 0xc08ff67b2d3f4868, 0x3d5105eb0607e59b + .quad 0xc08ff67fbd69ec18, 0xbd5e657842b37dc0 + .quad 0xc08ff6844bc76b68, 0x3d4ad1849705bc4c + .quad 0xc08ff688d85931c8, 0xbd508b6f92b6e0d6 + .quad 0xc08ff68d6320a920, 0x3d48683cceb5fdfc + .quad 0xc08ff691ec1f3990, 0xbd2c25ee290acbf5 + .quad 0xc08ff696735649a8, 0x3d58904932cd46d0 + .quad 0xc08ff69af8c73e38, 0xbd5c964167f0bfeb + .quad 0xc08ff69f7c737a90, 0xbd43d66937fa06a9 + .quad 0xc08ff6a3fe5c6040, 0xbd54bc302ffa76fb + .quad 0xc08ff6a87e834f50, 0x3d4609b1487f87a3 + .quad 0xc08ff6acfce9a618, 0xbd42c0d9af0400b1 + .quad 0xc08ff6b17990c170, 0x3d549a63973d262d + .quad 0xc08ff6b5f479fc80, 0xbd28cde894aa0641 + .quad 0xc08ff6ba6da6b0f0, 0xbd5acef617609a34 + .quad 0xc08ff6bee51836d8, 0x3d4abb9ff3cf80b8 + .quad 0xc08ff6c35acfe4a8, 0xbd53dcfa1b7697f3 + .quad 0xc08ff6c7cecf0f68, 0x3d5bcdf4aea18a55 + .quad 0xc08ff6cc41170a70, 0x3d3cad29d4324038 + .quad 0xc08ff6d0b1a927b0, 0x3d56945f9cc2a565 + .quad 0xc08ff6d52086b780, 0x3d5d20dfc1c668a7 + .quad 0xc08ff6d98db108b8, 0x3d37f20a9bcbbe04 + .quad 0xc08ff6ddf92968b8, 0x3d1e0824a6e3a4d2 + .quad 0xc08ff6e262f12358, 0xbd469f07bf6322c7 + .quad 0xc08ff6e6cb0982f8, 0xbd5cc593afdbfaef + .quad 0xc08ff6eb3173d080, 0xbd5ee68d555d7122 + .quad 0xc08ff6ef96315360, 0xbd144ee1d6a39124 + .quad 0xc08ff6f3f9435188, 0xbd40f2cb308bcd25 + .quad 0xc08ff6f85aab0f80, 0xbd5fd98ced08a73c + .quad 0xc08ff6fcba69d068, 0x3d54f2f2a1ea8606 + .quad 0xc08ff7011880d5d0, 0xbd57818234572db7 + .quad 0xc08ff70574f16008, 0x3d52429e823a9a83 + .quad 0xc08ff709cfbcadd0, 0x3d5d6dc9bb81476c + .quad 0xc08ff70e28e3fc90, 0x3d57d189e116bcb2 + .quad 0xc08ff71280688848, 0x3d0e18992809fd6d + .quad 0xc08ff716d64b8b98, 0xbd3b48ac92b8549a + .quad 0xc08ff71b2a8e3fb8, 0xbd4dcfa48040893b + .quad 0xc08ff71f7d31dc88, 0x3d58d945b8e53ef1 + .quad 0xc08ff723ce379878, 0x3d4f80faef3e15ee + .quad 0xc08ff7281da0a8b0, 0x3d53edc0fd40d18f + .quad 0xc08ff72c6b6e40f0, 0xbd4bcac66e0be72f + .quad 0xc08ff730b7a193b0, 0xbd44fcf96e2ec967 + .quad 0xc08ff735023bd208, 0x3d57e2ff34b08d86 + .quad 0xc08ff7394b3e2bb0, 0xbd4caedfb10b98dd + .quad 0xc08ff73d92a9cf28, 0xbd55db1083e5ac6a + .quad 0xc08ff741d87fe990, 0xbd580e83e6d54ed6 + .quad 0xc08ff7461cc1a6c0, 0x3d1688c83e1b0cba + .quad 0xc08ff74a5f703138, 0xbd52c398c872b701 + .quad 0xc08ff74ea08cb240, 0xbd49aabc3683b259 + .quad 0xc08ff752e01851d0, 0x3d5ccba8de72495b + .quad 0xc08ff7571e143688, 0xbd5981cf630f5793 + .quad 0xc08ff75b5a8185e8, 0xbd4f235844e01ebd + .quad 0xc08ff75f95616410, 0xbd5047de7ba8ec62 + .quad 0xc08ff763ceb4f3f0, 0x3d5fa55e004d6562 + .quad 0xc08ff768067d5720, 0xbd49f386e521a80e + .quad 0xc08ff76c3cbbae20, 0x3d3693551e62fe83 + .quad 0xc08ff77071711818, 0x3d4ba63b30b6c42c + .quad 0xc08ff774a49eb300, 0x3d4c26523d32f573 + .quad 0xc08ff778d6459b98, 0x3d3b65e70806143a + .quad 0xc08ff77d0666ed68, 0xbd5796d9c9f2c2cb + .quad 0xc08ff7813503c2d0, 0x3d33267b004b912b + .quad 0xc08ff785621d34e8, 0x3d1d5d8a23e33341 + .quad 0xc08ff7898db45ba8, 0x3d46c95233e60f40 + .quad 0xc08ff78db7ca4dd0, 0x3d362865acc8f43f + .quad 0xc08ff791e06020f8, 0xbd10e8203e161511 + .quad 0xc08ff7960776e988, 0xbd5cafe4f4467eaa + .quad 0xc08ff79a2d0fbac8, 0xbd520fddea9ea0cd + .quad 0xc08ff79e512ba6d0, 0x3d5c53d3778dae46 + .quad 0xc08ff7a273cbbe80, 0xbd5f0f6f88490367 + .quad 0xc08ff7a694f111c0, 0x3d5601aa3f55ec11 + .quad 0xc08ff7aab49caf20, 0xbd4f1a8a2328a4c4 + .quad 0xc08ff7aed2cfa438, 0xbd4a3d5341c07d0e + .quad 0xc08ff7b2ef8afd68, 0xbd5f4a1f4c525f31 + .quad 0xc08ff7b70acfc600, 0xbd4d594d77b3d775 + .quad 0xc08ff7bb249f0828, 0x3d2aef47e37e953b + .quad 0xc08ff7bf3cf9ccf0, 0x3d501803b47dfba2 + .quad 0xc08ff7c353e11c50, 0x3d5ed5ec84e5745e + .quad 0xc08ff7c76955fd20, 0xbd3de249bc9e7f96 + .quad 0xc08ff7cb7d597538, 0x3d5b5794341d1fdf + .quad 0xc08ff7cf8fec8938, 0xbd519dbd08276359 + .quad 0xc08ff7d3a1103cd0, 0xbd450129b8038848 + .quad 0xc08ff7d7b0c59288, 0x3d348f00d3bb30fd + .quad 0xc08ff7dbbf0d8bd8, 0xbd43529025720d8a + .quad 0xc08ff7dfcbe92938, 0x3d5abdaa2b1955d7 + .quad 0xc08ff7e3d75969f8, 0xbd4e8837d4588a98 + .quad 0xc08ff7e7e15f4c80, 0x3d57a782a6df5a1f + .quad 0xc08ff7ebe9fbce08, 0x3d304ba3eaa96bf1 + .quad 0xc08ff7eff12fead8, 0xbd47aab17b868a60 + .quad 0xc08ff7f3f6fc9e28, 0xbd5bd858693ba90a + .quad 0xc08ff7f7fb62e230, 0x3d26abb2c547789a + .quad 0xc08ff7fbfe63b010, 0xbd59d383d543b3f5 + .quad 0xc08ff80000000000, 0x8000000000000000 + /*== Log_LA_table ==*/ + .align 32 + .quad 0x0000000000000000 + .quad 0xbf670f83ff0a7565 + .quad 0xbf7709c46d7aac77 + .quad 0xbf8143068125dd0e + .quad 0xbf86fe50b6ef0851 + .quad 0xbf8cb6c3abd14559 + .quad 0xbf91363117a97b0c + .quad 0xbf940f9786685d29 + .quad 0xbf96e79685c2d22a + .quad 0xbf99be2f7749acc2 + .quad 0xbf9c9363ba850f86 + .quad 0xbf9f6734acf8695a + .quad 0xbfa11cd1d5133413 + .quad 0xbfa2855905ca70f6 + .quad 0xbfa3ed3094685a26 + .quad 0xbfa554592bb8cd58 + .quad 0xbfa6bad3758efd87 + .quad 0xbfa820a01ac754cb + .quad 0xbfa985bfc3495194 + .quad 0xbfaaea3316095f72 + .quad 0xbfac4dfab90aab5f + .quad 0xbfadb1175160f3b0 + .quad 0xbfaf1389833253a0 + .quad 0xbfb03aa8f8dc854c + .quad 0xbfb0eb389fa29f9b + .quad 0xbfb19b74069f5f0a + .quad 0xbfb24b5b7e135a3d + .quad 0xbfb2faef55ccb372 + .quad 0xbfb3aa2fdd27f1c3 + .quad 0xbfb4591d6310d85a + .quad 0xbfb507b836033bb7 + .quad 0xbfb5b600a40bd4f3 + .quad 0xbfb663f6fac91316 + .quad 0xbfb7119b876bea86 + .quad 0xbfb7beee96b8a281 + .quad 0xbfb86bf07507a0c7 + .quad 0xbfb918a16e46335b + .quad 0xbfb9c501cdf75872 + .quad 0xbfba7111df348494 + .quad 0xbfbb1cd1ecae66e7 + .quad 0xbfbbc84240adabba + .quad 0xbfbc73632513bd4f + .quad 0xbfbd1e34e35b82da + .quad 0xbfbdc8b7c49a1ddb + .quad 0xbfbe72ec117fa5b2 + .quad 0xbfbf1cd21257e18c + .quad 0xbfbfc66a0f0b00a5 + .quad 0xbfc037da278f2870 + .quad 0xbfc08c588cda79e4 + .quad 0xbfc0e0b05ac848ed + .quad 0xbfc134e1b489062e + .quad 0xbfc188ecbd1d16be + .quad 0xbfc1dcd197552b7b + .quad 0xbfc2309065d29791 + .quad 0xbfc284294b07a640 + .quad 0xbfc2d79c6937efdd + .quad 0xbfc32ae9e278ae1a + .quad 0xbfc37e11d8b10f89 + .quad 0xbfc3d1146d9a8a64 + .quad 0xbfc423f1c2c12ea2 + .quad 0xbfc476a9f983f74d + .quad 0xbfc4c93d33151b24 + .quad 0xbfc51bab907a5c8a + .quad 0xbfc56df5328d58c5 + .quad 0xbfc5c01a39fbd688 + .quad 0xbfc6121ac74813cf + .quad 0xbfc663f6fac91316 + .quad 0xbfc6b5aef4aae7dc + .quad 0xbfc70742d4ef027f + .quad 0xbfc758b2bb6c7b76 + .quad 0xbfc7a9fec7d05ddf + .quad 0xbfc7fb27199df16d + .quad 0xbfc84c2bd02f03b3 + .quad 0xbfc89d0d0ab430cd + .quad 0xbfc8edcae8352b6c + .quad 0xbfc93e6587910444 + .quad 0xbfc98edd077e70df + .quad 0xbfc9df31868c11d5 + .quad 0xbfca2f632320b86b + .quad 0xbfca7f71fb7bab9d + .quad 0xbfcacf5e2db4ec94 + .quad 0xbfcb1f27d7bd7a80 + .quad 0xbfcb6ecf175f95e9 + .quad 0xbfcbbe540a3f036f + .quad 0xbfcc0db6cdd94dee + .quad 0xbfcc5cf77f860826 + .quad 0xbfccac163c770dc9 + .quad 0xbfccfb1321b8c400 + .quad 0xbfcd49ee4c325970 + .quad 0xbfcd98a7d8a605a7 + .quad 0xbfcde73fe3b1480f + .quad 0xbfce35b689cd2655 + .quad 0xbfce840be74e6a4d + .quad 0xbfced2401865df52 + .quad 0xbfcf205339208f27 + .quad 0xbfcf6e456567fe55 + .quad 0xbfcfbc16b902680a + .quad 0xbfd004e3a7c97cbd + .quad 0xbfd02baba24d0664 + .quad 0xbfd0526359bab1b3 + .quad 0xbfd0790adbb03009 + .quad 0xbfd09fa235ba2020 + .quad 0xbfd0c62975542a8f + .quad 0xbfd0eca0a7e91e0b + .quad 0xbfd11307dad30b76 + .quad 0xbfd1395f1b5b61a6 + .quad 0xbfd15fa676bb08ff + .quad 0xbfd185ddfa1a7ed0 + .quad 0xbfd1ac05b291f070 + .quad 0xbfd1d21dad295632 + .quad 0xbfd1f825f6d88e13 + .quad 0xbfd21e1e9c877639 + .quad 0xbfd24407ab0e073a + .quad 0xbfd269e12f346e2c + .quad 0xbfd28fab35b32683 + .quad 0xbfd2b565cb3313b6 + .quad 0xbfd2db10fc4d9aaf + .quad 0xbfd300acd58cbb10 + .quad 0xbfd32639636b2836 + .quad 0xbfd34bb6b2546218 + .quad 0xbfd37124cea4cded + .quad 0xbfd39683c4a9ce9a + .quad 0xbfd3bbd3a0a1dcfb + .quad 0xbfd3e1146ebc9ff2 + .quad 0xbfd406463b1b0449 + .quad 0xbfd42b6911cf5465 + .quad 0xbfd4507cfedd4fc4 + .quad 0xbfd475820e3a4251 + .quad 0xbfd49a784bcd1b8b + .quad 0xbfd4bf5fc36e8577 + .quad 0xbfd4e43880e8fb6a + .quad 0xbfd509028ff8e0a2 + .quad 0xbfd52dbdfc4c96b3 + .quad 0xbfd5526ad18493ce + .quad 0xbfd577091b3378cb + .quad 0xbfd59b98e4de271c + .quad 0xbfd5c01a39fbd688 + .quad 0xbfd5e48d25f62ab9 + .quad 0xbfd608f1b42948ae + .quad 0xbfd62d47efe3ebee + .quad 0xbfd6518fe4677ba7 + .quad 0xbfd675c99ce81f92 + .quad 0xbfd699f5248cd4b8 + .quad 0xbfd6be12866f820d + .quad 0xbfd6e221cd9d0cde + .quad 0xbfd7062305156d1d + .quad 0xbfd72a1637cbc183 + .quad 0xbfd74dfb70a66388 + .quad 0xbfd771d2ba7efb3c + .quad 0xbfd7959c202292f1 + .quad 0xbfd7b957ac51aac4 + .quad 0xbfd7dd0569c04bff + .quad 0xbfd800a563161c54 + .quad 0xbfd82437a2ee70f7 + .quad 0xbfd847bc33d8618e + .quad 0xbfd86b332056db01 + .quad 0xbfd88e9c72e0b226 + .quad 0xbfd8b1f835e0b642 + .quad 0xbfd8d54673b5c372 + .quad 0xbfd8f88736b2d4e8 + .quad 0xbfd91bba891f1709 + .quad 0xbfd93ee07535f967 + .quad 0xbfd961f90527409c + .quad 0xbfd98504431717fc + .quad 0xbfd9a802391e232f + .quad 0xbfd9caf2f1498fa4 + .quad 0xbfd9edd6759b25e0 + .quad 0xbfda10acd0095ab4 + .quad 0xbfda33760a7f6051 + .quad 0xbfda56322edd3731 + .quad 0xbfda78e146f7bef4 + .quad 0xbfda9b835c98c70a + .quad 0xbfdabe18797f1f49 + .quad 0xbfdae0a0a75ea862 + .quad 0xbfdb031befe06434 + .quad 0xbfdb258a5ca28608 + .quad 0xbfdb47ebf73882a1 + .quad 0xbfdb6a40c92b203f + .quad 0xbfdb8c88dbf8867a + .quad 0xbfdbaec439144dfd + .quad 0xbfdbd0f2e9e79031 + .quad 0xbfdbf314f7d0f6ba + .quad 0xbfdc152a6c24cae6 + .quad 0xbfdc3733502d04f8 + .quad 0xbfdc592fad295b56 + .quad 0xbfdc7b1f8c4f51a4 + .quad 0xbfdc9d02f6ca47b4 + .quad 0xbfdcbed9f5bb886a + .quad 0xbfdce0a4923a587d + .quad 0xbfdd0262d554051c + .quad 0xbfdd2414c80bf27d + .quad 0xbfdd45ba735baa4f + .quad 0xbfdd6753e032ea0f + .quad 0xbfdd88e11777b149 + .quad 0xbfddaa6222064fb9 + .quad 0xbfddcbd708b17359 + .quad 0xbfdded3fd442364c + .quad 0xbfde0e9c8d782cbd + .quad 0xbfde2fed3d097298 + .quad 0xbfde5131eba2b931 + .quad 0xbfde726aa1e754d2 + .quad 0xbfde939768714a32 + .quad 0xbfdeb4b847d15bce + .quad 0xbfded5cd488f1732 + .quad 0xbfdef6d67328e220 + .quad 0xbfdf17d3d01407af + .quad 0xbfdf38c567bcc541 + .quad 0xbfdf59ab4286576c + .quad 0xbfdf7a8568cb06cf + .quad 0xbfdf9b53e2dc34c4 + .quad 0xbfdfbc16b902680a + .quad 0xbfdfdccdf37d594c + .quad 0xbfdffd799a83ff9b + .quad 0x3fdfe1e649bb6335 + .quad 0x3fdfc151b11b3640 + .quad 0x3fdfa0c8937e7d5d + .quad 0x3fdf804ae8d0cd02 + .quad 0x3fdf5fd8a9063e35 + .quad 0x3fdf3f71cc1b629c + .quad 0x3fdf1f164a15389a + .quad 0x3fdefec61b011f85 + .quad 0x3fdede8136f4cbf1 + .quad 0x3fdebe47960e3c08 + .quad 0x3fde9e193073ac06 + .quad 0x3fde7df5fe538ab3 + .quad 0x3fde5dddf7e46e0a + .quad 0x3fde3dd1156507de + .quad 0x3fde1dcf4f1c1a9e + .quad 0x3fddfdd89d586e2b + .quad 0x3fddddecf870c4c1 + .quad 0x3fddbe0c58c3cff2 + .quad 0x3fdd9e36b6b825b1 + .quad 0x3fdd7e6c0abc3579 + .quad 0x3fdd5eac4d463d7e + .quad 0x3fdd3ef776d43ff4 + .quad 0x3fdd1f4d7febf868 + .quad 0x3fdcffae611ad12b + .quad 0x3fdce01a12f5d8d1 + .quad 0x3fdcc0908e19b7bd + .quad 0x3fdca111cb2aa5c5 + .quad 0x3fdc819dc2d45fe4 + .quad 0x3fdc62346dca1dfe + .quad 0x3fdc42d5c4c688b4 + .quad 0x3fdc2381c08baf4f + .quad 0x3fdc043859e2fdb3 + .quad 0x3fdbe4f9899d326e + .quad 0x3fdbc5c5489254cc + .quad 0x3fdba69b8fa1ab02 + .quad 0x3fdb877c57b1b070 + .quad 0x3fdb686799b00be3 + .quad 0x3fdb495d4e9185f7 + .quad 0x3fdb2a5d6f51ff83 + .quad 0x3fdb0b67f4f46810 + .quad 0x3fdaec7cd882b46c + .quad 0x3fdacd9c130dd53f + .quad 0x3fdaaec59dadadbe + .quad 0x3fda8ff971810a5e + .quad 0x3fda713787ad97a5 + .quad 0x3fda527fd95fd8ff + .quad 0x3fda33d25fcb1fac + .quad 0x3fda152f142981b4 + .quad 0x3fd9f695efbbd0ef + .quad 0x3fd9d806ebc9921c + .quad 0x3fd9b98201a0f405 + .quad 0x3fd99b072a96c6b2 + .quad 0x3fd97c96600672ad + .quad 0x3fd95e2f9b51f04e + .quad 0x3fd93fd2d5e1bf1d + .quad 0x3fd921800924dd3b + .quad 0x3fd903372e90bee4 + .quad 0x3fd8e4f83fa145ee + .quad 0x3fd8c6c335d8b966 + .quad 0x3fd8a8980abfbd32 + .quad 0x3fd88a76b7e549c6 + .quad 0x3fd86c5f36dea3dc + .quad 0x3fd84e5181475449 + .quad 0x3fd8304d90c11fd3 + .quad 0x3fd812535ef3ff19 + .quad 0x3fd7f462e58e1688 + .quad 0x3fd7d67c1e43ae5c + .quad 0x3fd7b89f02cf2aad + .quad 0x3fd79acb8cf10390 + .quad 0x3fd77d01b66fbd37 + .quad 0x3fd75f417917e02c + .quad 0x3fd7418acebbf18f + .quad 0x3fd723ddb1346b65 + .quad 0x3fd7063a1a5fb4f2 + .quad 0x3fd6e8a004221b1f + .quad 0x3fd6cb0f6865c8ea + .quad 0x3fd6ad88411abfea + .quad 0x3fd6900a8836d0d5 + .quad 0x3fd6729637b59418 + .quad 0x3fd6552b49986277 + .quad 0x3fd637c9b7e64dc2 + .quad 0x3fd61a717cac1983 + .quad 0x3fd5fd2291fc33cf + .quad 0x3fd5dfdcf1eeae0e + .quad 0x3fd5c2a096a135dc + .quad 0x3fd5a56d7a370ded + .quad 0x3fd5884396d90702 + .quad 0x3fd56b22e6b578e5 + .quad 0x3fd54e0b64003b70 + .quad 0x3fd530fd08f29fa7 + .quad 0x3fd513f7cfcb68ce + .quad 0x3fd4f6fbb2cec598 + .quad 0x3fd4da08ac46495a + .quad 0x3fd4bd1eb680e548 + .quad 0x3fd4a03dcbd2e1be + .quad 0x3fd48365e695d797 + .quad 0x3fd466970128a987 + .quad 0x3fd449d115ef7d87 + .quad 0x3fd42d141f53b646 + .quad 0x3fd4106017c3eca3 + .quad 0x3fd3f3b4f9b3e939 + .quad 0x3fd3d712bf9c9def + .quad 0x3fd3ba7963fc1f8f + .quad 0x3fd39de8e1559f6f + .quad 0x3fd3816132316520 + .quad 0x3fd364e2511cc821 + .quad 0x3fd3486c38aa29a8 + .quad 0x3fd32bfee370ee68 + .quad 0x3fd30f9a4c0d786d + .quad 0x3fd2f33e6d2120f2 + .quad 0x3fd2d6eb4152324f + .quad 0x3fd2baa0c34be1ec + .quad 0x3fd29e5eedbe4a35 + .quad 0x3fd28225bb5e64a4 + .quad 0x3fd265f526e603cb + .quad 0x3fd249cd2b13cd6c + .quad 0x3fd22dadc2ab3497 + .quad 0x3fd21196e87473d1 + .quad 0x3fd1f588973c8747 + .quad 0x3fd1d982c9d52708 + .quad 0x3fd1bd857b14c146 + .quad 0x3fd1a190a5d674a0 + .quad 0x3fd185a444fa0a7b + .quad 0x3fd169c05363f158 + .quad 0x3fd14de4cbfd373e + .quad 0x3fd13211a9b38424 + .quad 0x3fd11646e7791469 + .quad 0x3fd0fa848044b351 + .quad 0x3fd0deca6f11b58b + .quad 0x3fd0c318aedff3c0 + .quad 0x3fd0a76f3ab3c52c + .quad 0x3fd08bce0d95fa38 + .quad 0x3fd070352293d724 + .quad 0x3fd054a474bf0eb7 + .quad 0x3fd0391bff2dbcf3 + .quad 0x3fd01d9bbcfa61d4 + .quad 0x3fd00223a943dc19 + .quad 0x3fcfcd677e5ac81d + .quad 0x3fcf9697f3bd0ccf + .quad 0x3fcf5fd8a9063e35 + .quad 0x3fcf29299496a889 + .quad 0x3fcef28aacd72231 + .quad 0x3fcebbfbe83901a6 + .quad 0x3fce857d3d361368 + .quad 0x3fce4f0ea2509008 + .quad 0x3fce18b00e13123d + .quad 0x3fcde26177108d03 + .quad 0x3fcdac22d3e441d3 + .quad 0x3fcd75f41b31b6dd + .quad 0x3fcd3fd543a4ad5c + .quad 0x3fcd09c643f117f0 + .quad 0x3fccd3c712d31109 + .quad 0x3fcc9dd7a70ed160 + .quad 0x3fcc67f7f770a67e + .quad 0x3fcc3227facce950 + .quad 0x3fcbfc67a7fff4cc + .quad 0x3fcbc6b6f5ee1c9b + .quad 0x3fcb9115db83a3dd + .quad 0x3fcb5b844fb4b3ef + .quad 0x3fcb2602497d5346 + .quad 0x3fcaf08fbfe15c51 + .quad 0x3fcabb2ca9ec7472 + .quad 0x3fca85d8feb202f7 + .quad 0x3fca5094b54d2828 + .quad 0x3fca1b5fc4e0b465 + .quad 0x3fc9e63a24971f46 + .quad 0x3fc9b123cba27ed3 + .quad 0x3fc97c1cb13c7ec1 + .quad 0x3fc94724cca657be + .quad 0x3fc9123c1528c6ce + .quad 0x3fc8dd62821404a9 + .quad 0x3fc8a8980abfbd32 + .quad 0x3fc873dca68b06f4 + .quad 0x3fc83f304cdc5aa7 + .quad 0x3fc80a92f5218acc + .quad 0x3fc7d60496cfbb4c + .quad 0x3fc7a18529635926 + .quad 0x3fc76d14a4601225 + .quad 0x3fc738b2ff50ccad + .quad 0x3fc7046031c79f85 + .quad 0x3fc6d01c335dc9b5 + .quad 0x3fc69be6fbb3aa6f + .quad 0x3fc667c08270b905 + .quad 0x3fc633a8bf437ce1 + .quad 0x3fc5ff9fa9e18595 + .quad 0x3fc5cba53a0762ed + .quad 0x3fc597b967789d12 + .quad 0x3fc563dc29ffacb2 + .quad 0x3fc5300d796df33a + .quad 0x3fc4fc4d4d9bb313 + .quad 0x3fc4c89b9e6807f5 + .quad 0x3fc494f863b8df35 + .quad 0x3fc46163957af02e + .quad 0x3fc42ddd2ba1b4a9 + .quad 0x3fc3fa651e276158 + .quad 0x3fc3c6fb650cde51 + .quad 0x3fc3939ff859bf9f + .quad 0x3fc36052d01c3dd7 + .quad 0x3fc32d13e4692eb7 + .quad 0x3fc2f9e32d5bfdd1 + .quad 0x3fc2c6c0a316a540 + .quad 0x3fc293ac3dc1a668 + .quad 0x3fc260a5f58c02bd + .quad 0x3fc22dadc2ab3497 + .quad 0x3fc1fac39d5b280c + .quad 0x3fc1c7e77dde33dc + .quad 0x3fc195195c7d125b + .quad 0x3fc162593186da70 + .quad 0x3fc12fa6f550f896 + .quad 0x3fc0fd02a03727ea + .quad 0x3fc0ca6c2a9b6b41 + .quad 0x3fc097e38ce60649 + .quad 0x3fc06568bf8576b3 + .quad 0x3fc032fbbaee6d65 + .quad 0x3fc0009c779bc7b5 + .quad 0x3fbf9c95dc1d1165 + .quad 0x3fbf380e2d9ba4df + .quad 0x3fbed3a1d4cdbebb + .quad 0x3fbe6f50c2d9f754 + .quad 0x3fbe0b1ae8f2fd56 + .quad 0x3fbda700385788a2 + .quad 0x3fbd4300a2524d41 + .quad 0x3fbcdf1c1839ee74 + .quad 0x3fbc7b528b70f1c5 + .quad 0x3fbc17a3ed65b23c + .quad 0x3fbbb4102f925394 + .quad 0x3fbb5097437cb58e + .quad 0x3fbaed391ab6674e + .quad 0x3fba89f5a6dc9acc + .quad 0x3fba26ccd9981853 + .quad 0x3fb9c3bea49d3214 + .quad 0x3fb960caf9abb7ca + .quad 0x3fb8fdf1ca8eea6a + .quad 0x3fb89b33091d6fe8 + .quad 0x3fb8388ea739470a + .quad 0x3fb7d60496cfbb4c + .quad 0x3fb77394c9d958d5 + .quad 0x3fb7113f3259e07a + .quad 0x3fb6af03c2603bd0 + .quad 0x3fb64ce26c067157 + .quad 0x3fb5eadb217198a3 + .quad 0x3fb588edd4d1ceaa + .quad 0x3fb5271a78622a0f + .quad 0x3fb4c560fe68af88 + .quad 0x3fb463c15936464e + .quad 0x3fb4023b7b26ac9e + .quad 0x3fb3a0cf56a06c4b + .quad 0x3fb33f7cde14cf5a + .quad 0x3fb2de4403ffd4b3 + .quad 0x3fb27d24bae824db + .quad 0x3fb21c1ef55f06c2 + .quad 0x3fb1bb32a600549d + .quad 0x3fb15a5fbf7270ce + .quad 0x3fb0f9a634663add + .quad 0x3fb09905f797047c + .quad 0x3fb0387efbca869e + .quad 0x3fafb02267a1ad2d + .quad 0x3faeef792508b69d + .quad 0x3fae2f02159384fe + .quad 0x3fad6ebd1f1febfe + .quad 0x3facaeaa27a02241 + .quad 0x3fabeec9151aac2e + .quad 0x3fab2f19cdaa46dc + .quad 0x3faa6f9c377dd31b + .quad 0x3fa9b05038d84095 + .quad 0x3fa8f135b8107912 + .quad 0x3fa8324c9b914bc7 + .quad 0x3fa77394c9d958d5 + .quad 0x3fa6b50e297afcce + .quad 0x3fa5f6b8a11c3c61 + .quad 0x3fa538941776b01e + .quad 0x3fa47aa07357704f + .quad 0x3fa3bcdd9b9f00f3 + .quad 0x3fa2ff4b77413dcb + .quad 0x3fa241e9ed454683 + .quad 0x3fa184b8e4c56af8 + .quad 0x3fa0c7b844ef1795 + .quad 0x3fa00ae7f502c1c4 + .quad 0x3f9e9c8fb8a7a900 + .quad 0x3f9d23afc49139f9 + .quad 0x3f9bab2fdcb46ec7 + .quad 0x3f9a330fd028f75f + .quad 0x3f98bb4f6e2bd536 + .quad 0x3f9743ee861f3556 + .quad 0x3f95ccece78a4a9e + .quad 0x3f94564a62192834 + .quad 0x3f92e006c59c9c29 + .quad 0x3f916a21e20a0a45 + .quad 0x3f8fe9370ef68e1b + .quad 0x3f8cfee70c5ce5dc + .quad 0x3f8a15535d0bab34 + .quad 0x3f872c7ba20f7327 + .quad 0x3f84445f7cbc8fd2 + .quad 0x3f815cfe8eaec830 + .quad 0x3f7cecb0f3922091 + .quad 0x3f7720d9c06a835f + .quad 0x3f715676c8c7a8c1 + .quad 0x3f671b0ea42e5fda + .quad 0x3f57182a894b69c6 + .quad 0x8000000000000000 + /*== poly_coeff[5] ==*/ + .align 32 + .quad 0x3fd2776E996DA1D2, 0x3fd2776E996DA1D2, 0x3fd2776E996DA1D2, 0x3fd2776E996DA1D2 /* coeff5 */ + .quad 0xbfd715494C3E7C9B, 0xbfd715494C3E7C9B, 0xbfd715494C3E7C9B, 0xbfd715494C3E7C9B /* coeff4 */ + .quad 0x3fdEC709DC39E926, 0x3fdEC709DC39E926, 0x3fdEC709DC39E926, 0x3fdEC709DC39E926 /* coeff3 */ + .quad 0xbfe71547652B7CF8, 0xbfe71547652B7CF8, 0xbfe71547652B7CF8, 0xbfe71547652B7CF8 /* coeff2 */ + .quad 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE, 0x3ff71547652B82FE /* coeff1 */ + /*== ExpMask ==*/ + .align 32 + .quad 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 32 + .quad 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000 + /*== MinNorm ==*/ + .align 32 + .quad 0x0010000000000000, 0x0010000000000000, 0x0010000000000000, 0x0010000000000000 + /*== MaxNorm ==*/ + .align 32 + .quad 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff + /*== HalfMask ==*/ + .align 32 + .quad 0xfffffffffc000000, 0xfffffffffc000000, 0xfffffffffc000000, 0xfffffffffc000000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== Threshold ==*/ + .align 32 + .quad 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 32 + .quad 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 32 + .quad 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000 + .align 32 + .type __svml_dlog2_data_internal,@object + .size __svml_dlog2_data_internal,.-__svml_dlog2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core-avx2.S new file mode 100644 index 0000000000..804de5fe0c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized log2, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_log2 _ZGVeN8v_log2_avx2_wrapper +#include "../svml_d_log28_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core.c new file mode 100644 index 0000000000..bd55abecc7 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log2, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_log2 +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_log2, __GI__ZGVeN8v_log2, __redirect__ZGVeN8v_log2) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core_avx512.S new file mode 100644 index 0000000000..211a78f315 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log28_core_avx512.S @@ -0,0 +1,293 @@ +/* Function log2 vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log2(x) = k - log2(Rcp) + poly_approximation(R) + * log2(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog2_data_internal_avx512 + */ +#define Log_tbl 0 +#define One 128 +#define C075 192 +#define poly_coeff9 256 +#define poly_coeff8 320 +#define poly_coeff7 384 +#define poly_coeff6 448 +#define poly_coeff5 512 +#define poly_coeff4 576 +#define poly_coeff3 640 +#define poly_coeff2 704 +#define poly_coeff1 768 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_log2_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovaps %zmm0, %zmm7 + vgetmantpd $8, {sae}, %zmm7, %zmm6 + vmovups One+__svml_dlog2_data_internal_avx512(%rip), %zmm2 + vmovups poly_coeff5+__svml_dlog2_data_internal_avx512(%rip), %zmm12 + vmovups poly_coeff3+__svml_dlog2_data_internal_avx512(%rip), %zmm13 + +/* Start polynomial evaluation */ + vmovups poly_coeff9+__svml_dlog2_data_internal_avx512(%rip), %zmm10 + vmovups poly_coeff8+__svml_dlog2_data_internal_avx512(%rip), %zmm0 + vmovups poly_coeff7+__svml_dlog2_data_internal_avx512(%rip), %zmm11 + vmovups poly_coeff6+__svml_dlog2_data_internal_avx512(%rip), %zmm14 + +/* Prepare exponent correction: DblRcp<0.75? */ + vmovups C075+__svml_dlog2_data_internal_avx512(%rip), %zmm1 + +/* Table lookup */ + vmovups __svml_dlog2_data_internal_avx512(%rip), %zmm4 + +/* GetExp(x) */ + vgetexppd {sae}, %zmm7, %zmm5 + +/* DblRcp ~ 1/Mantissa */ + vrcp14pd %zmm6, %zmm8 + +/* x<=0? */ + vfpclasspd $94, %zmm7, %k0 + +/* round DblRcp to 4 fractional bits (RN mode, no Precision exception) */ + vrndscalepd $88, {sae}, %zmm8, %zmm3 + vmovups poly_coeff4+__svml_dlog2_data_internal_avx512(%rip), %zmm8 + kmovw %k0, %edx + +/* Reduced argument: R = DblRcp*Mantissa - 1 */ + vfmsub213pd {rn-sae}, %zmm2, %zmm3, %zmm6 + vcmppd $17, {sae}, %zmm1, %zmm3, %k1 + vfmadd231pd {rn-sae}, %zmm6, %zmm12, %zmm8 + vmovups poly_coeff2+__svml_dlog2_data_internal_avx512(%rip), %zmm12 + vfmadd231pd {rn-sae}, %zmm6, %zmm10, %zmm0 + vfmadd231pd {rn-sae}, %zmm6, %zmm11, %zmm14 + vmovups poly_coeff1+__svml_dlog2_data_internal_avx512(%rip), %zmm1 + +/* R^2 */ + vmulpd {rn-sae}, %zmm6, %zmm6, %zmm15 + vfmadd231pd {rn-sae}, %zmm6, %zmm13, %zmm12 + +/* Prepare table index */ + vpsrlq $48, %zmm3, %zmm9 + +/* add 1 to Expon if DblRcp<0.75 */ + vaddpd {rn-sae}, %zmm2, %zmm5, %zmm5{%k1} + vmulpd {rn-sae}, %zmm15, %zmm15, %zmm13 + vfmadd213pd {rn-sae}, %zmm14, %zmm15, %zmm0 + vfmadd213pd {rn-sae}, %zmm12, %zmm15, %zmm8 + vpermt2pd Log_tbl+64+__svml_dlog2_data_internal_avx512(%rip), %zmm9, %zmm4 + +/* polynomial */ + vfmadd213pd {rn-sae}, %zmm8, %zmm13, %zmm0 + vfmadd213pd {rn-sae}, %zmm1, %zmm6, %zmm0 + vfmadd213pd {rn-sae}, %zmm4, %zmm0, %zmm6 + vaddpd {rn-sae}, %zmm6, %zmm5, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm7 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm7, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call log2@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_log2_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dlog2_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl[16][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 C075[8][2]; + __declspec(align(64)) VUINT32 poly_coeff9[8][2]; + __declspec(align(64)) VUINT32 poly_coeff8[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + } __svml_dlog2_data_internal_avx512; +#endif +__svml_dlog2_data_internal_avx512: + /*== Log_tbl ==*/ + .quad 0x0000000000000000 + .quad 0xbfb663f6fac91316 + .quad 0xbfc5c01a39fbd688 + .quad 0xbfcfbc16b902680a + .quad 0xbfd49a784bcd1b8b + .quad 0xbfd91bba891f1709 + .quad 0xbfdd6753e032ea0f + .quad 0xbfe0c10500d63aa6 + .quad 0x3fda8ff971810a5e + .quad 0x3fd6cb0f6865c8ea + .quad 0x3fd32bfee370ee68 + .quad 0x3fcf5fd8a9063e35 + .quad 0x3fc8a8980abfbd32 + .quad 0x3fc22dadc2ab3497 + .quad 0x3fb7d60496cfbb4c + .quad 0x3fa77394c9d958d5 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== C075 0.75 ==*/ + .align 64 + .quad 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000 + /*== poly_coeff9 ==*/ + .align 64 + .quad 0x3fc4904bda0e1d12, 0x3fc4904bda0e1d12, 0x3fc4904bda0e1d12, 0x3fc4904bda0e1d12, 0x3fc4904bda0e1d12, 0x3fc4904bda0e1d12, 0x3fc4904bda0e1d12, 0x3fc4904bda0e1d12 + /*== poly_coeff8 ==*/ + .align 64 + .quad 0xbfc71fb84deb5cce, 0xbfc71fb84deb5cce, 0xbfc71fb84deb5cce, 0xbfc71fb84deb5cce, 0xbfc71fb84deb5cce, 0xbfc71fb84deb5cce, 0xbfc71fb84deb5cce, 0xbfc71fb84deb5cce + /*== poly_coeff7 ==*/ + .align 64 + .quad 0x3fca617351818613, 0x3fca617351818613, 0x3fca617351818613, 0x3fca617351818613, 0x3fca617351818613, 0x3fca617351818613, 0x3fca617351818613, 0x3fca617351818613 + /*== poly_coeff6 ==*/ + .align 64 + .quad 0xbfcec707e4e3144c, 0xbfcec707e4e3144c, 0xbfcec707e4e3144c, 0xbfcec707e4e3144c, 0xbfcec707e4e3144c, 0xbfcec707e4e3144c, 0xbfcec707e4e3144c, 0xbfcec707e4e3144c + /*== poly_coeff5 ==*/ + .align 64 + .quad 0x3fd2776c5114d91a, 0x3fd2776c5114d91a, 0x3fd2776c5114d91a, 0x3fd2776c5114d91a, 0x3fd2776c5114d91a, 0x3fd2776c5114d91a, 0x3fd2776c5114d91a, 0x3fd2776c5114d91a + /*== poly_coeff4 ==*/ + .align 64 + .quad 0xbfd71547653d0f8d, 0xbfd71547653d0f8d, 0xbfd71547653d0f8d, 0xbfd71547653d0f8d, 0xbfd71547653d0f8d, 0xbfd71547653d0f8d, 0xbfd71547653d0f8d, 0xbfd71547653d0f8d + /*== poly_coeff3 ==*/ + .align 64 + .quad 0x3fdec709dc3a029f, 0x3fdec709dc3a029f, 0x3fdec709dc3a029f, 0x3fdec709dc3a029f, 0x3fdec709dc3a029f, 0x3fdec709dc3a029f, 0x3fdec709dc3a029f, 0x3fdec709dc3a029f + /*== poly_coeff2 ==*/ + .align 64 + .quad 0xbfe71547652b82d4, 0xbfe71547652b82d4, 0xbfe71547652b82d4, 0xbfe71547652b82d4, 0xbfe71547652b82d4, 0xbfe71547652b82d4, 0xbfe71547652b82d4, 0xbfe71547652b82d4 + /*== poly_coeff1 ==*/ + .align 64 + .quad 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe, 0x3ff71547652b82fe + .align 64 + .type __svml_dlog2_data_internal_avx512,@object + .size __svml_dlog2_data_internal_avx512,.-__svml_dlog2_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core-avx2.S new file mode 100644 index 0000000000..234bf4750b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized log2f. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_log2f _ZGVeN16v_log2f_avx2_wrapper +#include "../svml_s_log2f16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core.c new file mode 100644 index 0000000000..abf4f04988 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log2f, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_log2f +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_log2f, __GI__ZGVeN16v_log2f, + __redirect__ZGVeN16v_log2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core_avx512.S new file mode 100644 index 0000000000..c3a5aceef4 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f16_core_avx512.S @@ -0,0 +1,231 @@ +/* Function log2f vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log2(x) = k - log2(Rcp) + poly_approximation(R) + * log2(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog2_data_internal_avx512 + */ +#define One 0 +#define coeff4 64 +#define coeff3 128 +#define coeff2 192 +#define coeff1 256 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_log2f_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vgetmantps $11, {sae}, %zmm0, %zmm3 + vmovups __svml_slog2_data_internal_avx512(%rip), %zmm1 + vgetexpps {sae}, %zmm0, %zmm5 + +/* x<=0? */ + vfpclassps $94, %zmm0, %k0 + vsubps {rn-sae}, %zmm1, %zmm3, %zmm9 + vpsrld $19, %zmm3, %zmm7 + vgetexpps {sae}, %zmm3, %zmm6 + vpermps coeff4+__svml_slog2_data_internal_avx512(%rip), %zmm7, %zmm1 + vpermps coeff3+__svml_slog2_data_internal_avx512(%rip), %zmm7, %zmm2 + vpermps coeff2+__svml_slog2_data_internal_avx512(%rip), %zmm7, %zmm4 + vpermps coeff1+__svml_slog2_data_internal_avx512(%rip), %zmm7, %zmm8 + vsubps {rn-sae}, %zmm6, %zmm5, %zmm10 + vfmadd213ps {rn-sae}, %zmm2, %zmm9, %zmm1 + kmovw %k0, %edx + vfmadd213ps {rn-sae}, %zmm4, %zmm9, %zmm1 + vfmadd213ps {rn-sae}, %zmm8, %zmm9, %zmm1 + vfmadd213ps {rn-sae}, %zmm10, %zmm9, %zmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %zmm1, %zmm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm0, 64(%rsp) + vmovups %zmm1, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call log2f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_log2f_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_slog2_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 coeff4[16][1]; + __declspec(align(64)) VUINT32 coeff3[16][1]; + __declspec(align(64)) VUINT32 coeff2[16][1]; + __declspec(align(64)) VUINT32 coeff1[16][1]; + } __svml_slog2_data_internal_avx512; +#endif +__svml_slog2_data_internal_avx512: + /*== One ==*/ + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + // c4 + .align 64 + .long 0xbea77e4a, 0xbe8aae3d + .long 0xbe67fe32, 0xbe43d1b6 + .long 0xbe26a589, 0xbe0ee09b + .long 0xbdf6a8a1, 0xbdd63b49 + .long 0xbf584e51, 0xbf3e80a1 + .long 0xbf2892f0, 0xbf15d377 + .long 0xbf05b525, 0xbeef8e30 + .long 0xbed75c8f, 0xbec24184 + // c3 + .align 64 + .long 0x3ef5910c, 0x3ef045a1 + .long 0x3ee7d87e, 0x3eddbb84 + .long 0x3ed2d6df, 0x3ec7bbd2 + .long 0x3ebcc42f, 0x3eb22616 + .long 0x3e8f3399, 0x3eb1223e + .long 0x3ec9db4a, 0x3edb7a09 + .long 0x3ee79a1a, 0x3eef77cb + .long 0x3ef407a4, 0x3ef607b4 + // c2 + .align 64 + .long 0xbf38a934, 0xbf387de6 + .long 0xbf37f6f0, 0xbf37048b + .long 0xbf35a88a, 0xbf33ed04 + .long 0xbf31df56, 0xbf2f8d82 + .long 0xbf416814, 0xbf3daf58 + .long 0xbf3b5c08, 0xbf39fa2a + .long 0xbf393713, 0xbf38d7e1 + .long 0xbf38b2cd, 0xbf38aa62 + // c1 + .align 64 + .long 0x3fb8aa3b, 0x3fb8a9c0 + .long 0x3fb8a6e8, 0x3fb89f4e + .long 0x3fb890cb, 0x3fb879b1 + .long 0x3fb858d8, 0x3fb82d90 + .long 0x3fb8655e, 0x3fb8883a + .long 0x3fb89aea, 0x3fb8a42f + .long 0x3fb8a848, 0x3fb8a9c9 + .long 0x3fb8aa2f, 0x3fb8aa3b + .align 64 + .type __svml_slog2_data_internal_avx512,@object + .size __svml_slog2_data_internal_avx512,.-__svml_slog2_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core-sse2.S new file mode 100644 index 0000000000..dd0e763ac9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized log2f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_log2f _ZGVbN4v_log2f_sse2 +#include "../svml_s_log2f4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core.c new file mode 100644 index 0000000000..1eb68d9f52 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log2f, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_log2f +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_log2f, __GI__ZGVbN4v_log2f, + __redirect__ZGVbN4v_log2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core_sse4.S new file mode 100644 index 0000000000..a45ea919f4 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f4_core_sse4.S @@ -0,0 +1,223 @@ +/* Function log2f vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log2(x) = k - log2(Rcp) + poly_approximation(R) + * log2(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog2_data_internal + */ +#define MinNorm 0 +#define MaxNorm 16 +#define iBrkValue 32 +#define iOffExpoMask 48 +#define One 64 +#define sPoly 80 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_log2f_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm1 + +/* reduction: compute r,n */ + movdqu iBrkValue+__svml_slog2_data_internal(%rip), %xmm2 + movaps %xmm0, %xmm4 + movdqu iOffExpoMask+__svml_slog2_data_internal(%rip), %xmm10 + psubd %xmm2, %xmm1 + pand %xmm1, %xmm10 + movaps %xmm0, %xmm3 + paddd %xmm2, %xmm10 + psrad $23, %xmm1 + movups sPoly+__svml_slog2_data_internal(%rip), %xmm5 + movups sPoly+32+__svml_slog2_data_internal(%rip), %xmm6 + movups sPoly+64+__svml_slog2_data_internal(%rip), %xmm7 + movups sPoly+96+__svml_slog2_data_internal(%rip), %xmm9 + cmpltps MinNorm+__svml_slog2_data_internal(%rip), %xmm4 + cmpnleps MaxNorm+__svml_slog2_data_internal(%rip), %xmm3 + cvtdq2ps %xmm1, %xmm1 + subps One+__svml_slog2_data_internal(%rip), %xmm10 + mulps %xmm10, %xmm5 + movaps %xmm10, %xmm8 + mulps %xmm10, %xmm6 + mulps %xmm10, %xmm8 + addps sPoly+16+__svml_slog2_data_internal(%rip), %xmm5 + mulps %xmm10, %xmm7 + addps sPoly+48+__svml_slog2_data_internal(%rip), %xmm6 + mulps %xmm10, %xmm9 + mulps %xmm8, %xmm5 + addps sPoly+80+__svml_slog2_data_internal(%rip), %xmm7 + addps sPoly+112+__svml_slog2_data_internal(%rip), %xmm9 + addps %xmm5, %xmm6 + mulps %xmm8, %xmm6 + orps %xmm3, %xmm4 + +/* combine and get argument value range mask */ + movmskps %xmm4, %edx + addps %xmm6, %xmm7 + mulps %xmm7, %xmm8 + addps %xmm8, %xmm9 + mulps %xmm10, %xmm9 + addps sPoly+128+__svml_slog2_data_internal(%rip), %xmm9 + mulps %xmm9, %xmm10 + addps %xmm10, %xmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm1, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm1 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call log2f@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_log2f_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_slog2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 MinNorm[4][1]; + __declspec(align(16)) VUINT32 MaxNorm[4][1]; + __declspec(align(16)) VUINT32 iBrkValue[4][1]; + __declspec(align(16)) VUINT32 iOffExpoMask[4][1]; + __declspec(align(16)) VUINT32 One[4][1]; + __declspec(align(16)) VUINT32 sPoly[9][4][1]; +} __svml_slog2_data_internal; +#endif +__svml_slog2_data_internal: + /*== MinNorm ==*/ + .long 0x00800000, 0x00800000, 0x00800000, 0x00800000 + /*== MaxNorm ==*/ + .align 16 + .long 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff + /*== iBrkValue = SP 2/3 ==*/ + .align 16 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 16 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sOne = SP 1.0 ==*/ + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== spoly[9] ==*/ + .align 16 + .long 0x3e554012, 0x3e554012, 0x3e554012, 0x3e554012 /* coeff9 */ + .long 0xbe638E14, 0xbe638E14, 0xbe638E14, 0xbe638E14 /* coeff8 */ + .long 0x3e4D660B, 0x3e4D660B, 0x3e4D660B, 0x3e4D660B /* coeff7 */ + .long 0xbe727824, 0xbe727824, 0xbe727824, 0xbe727824 /* coeff6 */ + .long 0x3e93DD07, 0x3e93DD07, 0x3e93DD07, 0x3e93DD07 /* coeff5 */ + .long 0xbeB8B969, 0xbeB8B969, 0xbeB8B969, 0xbeB8B969 /* coeff4 */ + .long 0x3eF637C0, 0x3eF637C0, 0x3eF637C0, 0x3eF637C0 /* coeff3 */ + .long 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B /* coeff2 */ + .long 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B /* coeff1 */ + .align 16 + .type __svml_slog2_data_internal,@object + .size __svml_slog2_data_internal,.-__svml_slog2_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core-sse.S new file mode 100644 index 0000000000..ec4b70568d --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized log2f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_log2f _ZGVdN8v_log2f_sse_wrapper +#include "../svml_s_log2f8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core.c new file mode 100644 index 0000000000..b3e958021a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log2f, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_log2f +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_log2f, __GI__ZGVdN8v_log2f, + __redirect__ZGVdN8v_log2f) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core_avx2.S new file mode 100644 index 0000000000..bc0cb5081a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log2f8_core_avx2.S @@ -0,0 +1,226 @@ +/* Function log2f vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Get short reciprocal approximation Rcp ~ 1/mantissa(x) + * R = Rcp*x - 1.0 + * log2(x) = k - log2(Rcp) + poly_approximation(R) + * log2(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog2_data_internal + */ +#define MinNorm 0 +#define MaxNorm 32 +#define iBrkValue 64 +#define iOffExpoMask 96 +#define One 128 +#define sPoly 160 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_log2f_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + +/* reduction: compute r,n */ + vmovups iBrkValue+__svml_slog2_data_internal(%rip), %ymm4 + vmovups sPoly+64+__svml_slog2_data_internal(%rip), %ymm9 + vmovups sPoly+128+__svml_slog2_data_internal(%rip), %ymm10 + vmovups sPoly+192+__svml_slog2_data_internal(%rip), %ymm12 + vpsubd %ymm4, %ymm0, %ymm1 + vcmplt_oqps MinNorm+__svml_slog2_data_internal(%rip), %ymm0, %ymm5 + vcmpnle_uqps MaxNorm+__svml_slog2_data_internal(%rip), %ymm0, %ymm6 + vpand iOffExpoMask+__svml_slog2_data_internal(%rip), %ymm1, %ymm3 + vpsrad $23, %ymm1, %ymm2 + vmovups sPoly+__svml_slog2_data_internal(%rip), %ymm1 + vpaddd %ymm4, %ymm3, %ymm8 + vcvtdq2ps %ymm2, %ymm14 + vsubps One+__svml_slog2_data_internal(%rip), %ymm8, %ymm13 + vfmadd213ps sPoly+32+__svml_slog2_data_internal(%rip), %ymm13, %ymm1 + vfmadd213ps sPoly+96+__svml_slog2_data_internal(%rip), %ymm13, %ymm9 + vmulps %ymm13, %ymm13, %ymm11 + vfmadd213ps sPoly+160+__svml_slog2_data_internal(%rip), %ymm13, %ymm10 + vfmadd213ps sPoly+224+__svml_slog2_data_internal(%rip), %ymm13, %ymm12 + vfmadd213ps %ymm9, %ymm11, %ymm1 + vfmadd213ps %ymm10, %ymm11, %ymm1 + vfmadd213ps %ymm12, %ymm11, %ymm1 + vfmadd213ps sPoly+256+__svml_slog2_data_internal(%rip), %ymm13, %ymm1 + vorps %ymm6, %ymm5, %ymm7 + +/* combine and get argument value range mask */ + vmovmskps %ymm7, %edx + vfmadd213ps %ymm14, %ymm13, %ymm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + vmovaps %ymm1, %ymm0 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm0, 32(%rsp) + vmovups %ymm1, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm1 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm1 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call log2f@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_log2f_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_slog2_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 MinNorm[8][1]; + __declspec(align(32)) VUINT32 MaxNorm[8][1]; + __declspec(align(32)) VUINT32 iBrkValue[8][1]; + __declspec(align(32)) VUINT32 iOffExpoMask[8][1]; + __declspec(align(32)) VUINT32 One[8][1]; + __declspec(align(32)) VUINT32 sPoly[9][8][1]; +} __svml_slog2_data_internal; +#endif +__svml_slog2_data_internal: + /*== MinNorm ==*/ + .long 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000, 0x00800000 + /*== MaxNorm ==*/ + .align 32 + .long 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff + /*== iBrkValue = SP 2/3 ==*/ + .align 32 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 32 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sOne = SP 1.0 ==*/ + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== spoly[9] ==*/ + .align 32 + .long 0x3e554012, 0x3e554012, 0x3e554012, 0x3e554012, 0x3e554012, 0x3e554012, 0x3e554012, 0x3e554012 /* coeff9 */ + .long 0xbe638E14, 0xbe638E14, 0xbe638E14, 0xbe638E14, 0xbe638E14, 0xbe638E14, 0xbe638E14, 0xbe638E14 /* coeff8 */ + .long 0x3e4D660B, 0x3e4D660B, 0x3e4D660B, 0x3e4D660B, 0x3e4D660B, 0x3e4D660B, 0x3e4D660B, 0x3e4D660B /* coeff7 */ + .long 0xbe727824, 0xbe727824, 0xbe727824, 0xbe727824, 0xbe727824, 0xbe727824, 0xbe727824, 0xbe727824 /* coeff6 */ + .long 0x3e93DD07, 0x3e93DD07, 0x3e93DD07, 0x3e93DD07, 0x3e93DD07, 0x3e93DD07, 0x3e93DD07, 0x3e93DD07 /* coeff5 */ + .long 0xbeB8B969, 0xbeB8B969, 0xbeB8B969, 0xbeB8B969, 0xbeB8B969, 0xbeB8B969, 0xbeB8B969, 0xbeB8B969 /* coeff4 */ + .long 0x3eF637C0, 0x3eF637C0, 0x3eF637C0, 0x3eF637C0, 0x3eF637C0, 0x3eF637C0, 0x3eF637C0, 0x3eF637C0 /* coeff3 */ + .long 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B, 0xbf38AA2B /* coeff2 */ + .long 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B, 0x3fB8AA3B /* coeff1 */ + .align 32 + .type __svml_slog2_data_internal,@object + .size __svml_slog2_data_internal,.-__svml_slog2_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_log22_core.S b/sysdeps/x86_64/fpu/svml_d_log22_core.S new file mode 100644 index 0000000000..f181a62c7d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log22_core.S @@ -0,0 +1,29 @@ +/* Function log2 vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_log2) +WRAPPER_IMPL_SSE2 log2 +END (_ZGVbN2v_log2) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_log2) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_log24_core.S b/sysdeps/x86_64/fpu/svml_d_log24_core.S new file mode 100644 index 0000000000..b0a5aa9532 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log24_core.S @@ -0,0 +1,29 @@ +/* Function log2 vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_log2) +WRAPPER_IMPL_AVX _ZGVbN2v_log2 +END (_ZGVdN4v_log2) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_log2) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_log24_core_avx.S b/sysdeps/x86_64/fpu/svml_d_log24_core_avx.S new file mode 100644 index 0000000000..9a56cfed61 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log24_core_avx.S @@ -0,0 +1,25 @@ +/* Function log2 vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_log2) +WRAPPER_IMPL_AVX _ZGVbN2v_log2 +END (_ZGVcN4v_log2) diff --git a/sysdeps/x86_64/fpu/svml_d_log28_core.S b/sysdeps/x86_64/fpu/svml_d_log28_core.S new file mode 100644 index 0000000000..443cbfd578 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log28_core.S @@ -0,0 +1,25 @@ +/* Function log2 vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_log2) +WRAPPER_IMPL_AVX512 _ZGVdN4v_log2 +END (_ZGVeN8v_log2) diff --git a/sysdeps/x86_64/fpu/svml_s_log2f16_core.S b/sysdeps/x86_64/fpu/svml_s_log2f16_core.S new file mode 100644 index 0000000000..6cf265fd33 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log2f16_core.S @@ -0,0 +1,25 @@ +/* Function log2f vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_log2f) +WRAPPER_IMPL_AVX512 _ZGVdN8v_log2f +END (_ZGVeN16v_log2f) diff --git a/sysdeps/x86_64/fpu/svml_s_log2f4_core.S b/sysdeps/x86_64/fpu/svml_s_log2f4_core.S new file mode 100644 index 0000000000..024ba9b8c5 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log2f4_core.S @@ -0,0 +1,29 @@ +/* Function log2f vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_log2f) +WRAPPER_IMPL_SSE2 log2f +END (_ZGVbN4v_log2f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_log2f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_log2f8_core.S b/sysdeps/x86_64/fpu/svml_s_log2f8_core.S new file mode 100644 index 0000000000..5705590563 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log2f8_core.S @@ -0,0 +1,29 @@ +/* Function log2f vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_log2f) +WRAPPER_IMPL_AVX _ZGVbN4v_log2f +END (_ZGVdN8v_log2f) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_log2f) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_log2f8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_log2f8_core_avx.S new file mode 100644 index 0000000000..38602c475e --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log2f8_core_avx.S @@ -0,0 +1,25 @@ +/* Function log2f vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_log2f) +WRAPPER_IMPL_AVX _ZGVbN4v_log2f +END (_ZGVcN8v_log2f) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx.c new file mode 100644 index 0000000000..95d8e4bbd8 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx2.c new file mode 100644 index 0000000000..95d8e4bbd8 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx512f.c new file mode 100644 index 0000000000..95d8e4bbd8 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log2-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log2.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log2.c b/sysdeps/x86_64/fpu/test-double-libmvec-log2.c new file mode 100644 index 0000000000..326b6f1171 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log2.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC log2 +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 3dce136dfc..08c91ff634 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVbN2v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVbN2v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVbN2vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVbN2v_log10) +VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVbN2v_log2) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 1852625897..a2fb0de309 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVdN4v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVdN4v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVdN4vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVdN4v_log10) +VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVdN4v_log2) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index cf9ea35ffe..dc65a4ee25 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVcN4v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVcN4v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVcN4vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVcN4v_log10) +VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVcN4v_log2) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index b6457ea032..253ee8c906 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinh), _ZGVeN8v_sinh) VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVeN8v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVeN8vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVeN8v_log10) +VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVeN8v_log2) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx.c new file mode 100644 index 0000000000..c88b3fc5a9 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx2.c new file mode 100644 index 0000000000..c88b3fc5a9 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx512f.c new file mode 100644 index 0000000000..c88b3fc5a9 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log2f-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log2f.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log2f.c b/sysdeps/x86_64/fpu/test-float-libmvec-log2f.c new file mode 100644 index 0000000000..afba03d1e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log2f.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC log2f +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 272e754e1b..1c7db5146c 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVeN16v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVeN16v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVeN16vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVeN16v_log10f) +VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVeN16v_log2f) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index b892258b99..8ec51603b3 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVbN4v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVbN4v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVbN4vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVbN4v_log10f) +VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVbN4v_log2f) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 1c6ead71e1..1cb4553c7a 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVdN8v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVdN8v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVdN8vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVdN8v_log10f) +VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVdN8v_log2f) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 71f5d8d7b6..6ecc1792bb 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -39,6 +39,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (sinhf), _ZGVcN8v_sinhf) VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVcN8v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVcN8vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVcN8v_log10f) +VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVcN8v_log2f) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573817 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=QFYdfFli; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm751XY9z9sVq for ; Wed, 29 Dec 2021 07:17:53 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id B5756385843B for ; Tue, 28 Dec 2021 20:17:50 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B5756385843B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722670; bh=R7S5GzlSaTxGfPNcWbFtg4g0KwiZVIjF0E/sVsb3zAo=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=QFYdfFliqF5K7AlTw32duBy/WjmiyGmf7A9dgAatVyx3vwBX3SoLJl5yIYpNp25/M 6ijJLaXhSKt83ux65W4FG+ovI9g1T7qzDcXq4l9VTs4W+VqaKDe6RwXkVU7bqAjtk9 /+ZnIpuBhebykzp8YkSlfWVADw5kByBmrgceqlpw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by sourceware.org (Postfix) with ESMTPS id 8BB403858427 for ; Tue, 28 Dec 2021 20:11:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8BB403858427 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="240197874" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="240197874" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="618834320" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga004.jf.intel.com with ESMTP; 28 Dec 2021 12:11:32 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsh016522; Tue, 28 Dec 2021 12:11:32 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 13/18] x86-64: Add vector log1p/log1pf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:25 -0800 Message-Id: <20211228201130.737370-14-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.2 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized log1p/log1pf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector log1p/log1pf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_log1p2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_log1p2_core.c | 27 + .../fpu/multiarch/svml_d_log1p2_core_sse4.S | 1395 +++++++++++++++++ .../fpu/multiarch/svml_d_log1p4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_log1p4_core.c | 27 + .../fpu/multiarch/svml_d_log1p4_core_avx2.S | 1380 ++++++++++++++++ .../fpu/multiarch/svml_d_log1p8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_log1p8_core.c | 27 + .../fpu/multiarch/svml_d_log1p8_core_avx512.S | 317 ++++ .../fpu/multiarch/svml_s_log1pf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_log1pf16_core.c | 28 + .../multiarch/svml_s_log1pf16_core_avx512.S | 271 ++++ .../fpu/multiarch/svml_s_log1pf4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_log1pf4_core.c | 28 + .../fpu/multiarch/svml_s_log1pf4_core_sse4.S | 252 +++ .../fpu/multiarch/svml_s_log1pf8_core-sse.S | 20 + .../fpu/multiarch/svml_s_log1pf8_core.c | 28 + .../fpu/multiarch/svml_s_log1pf8_core_avx2.S | 254 +++ sysdeps/x86_64/fpu/svml_d_log1p2_core.S | 29 + sysdeps/x86_64/fpu/svml_d_log1p4_core.S | 29 + sysdeps/x86_64/fpu/svml_d_log1p4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_log1p8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_log1pf16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_log1pf4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_log1pf8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_log1pf8_core_avx.S | 25 + .../fpu/test-double-libmvec-log1p-avx.c | 1 + .../fpu/test-double-libmvec-log1p-avx2.c | 1 + .../fpu/test-double-libmvec-log1p-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-log1p.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-log1pf-avx.c | 1 + .../fpu/test-float-libmvec-log1pf-avx2.c | 1 + .../fpu/test-float-libmvec-log1pf-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-log1pf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 4441 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log1p2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log1p4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log1p4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_log1p8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log1pf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log1pf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log1pf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_log1pf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-log1p.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-log1pf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 73252615ca..845246fab9 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -241,4 +241,15 @@ #define __DECL_SIMD_log2f32x #define __DECL_SIMD_log2f64x #define __DECL_SIMD_log2f128x + +#define __DECL_SIMD_log1p +#define __DECL_SIMD_log1pf +#define __DECL_SIMD_log1pl +#define __DECL_SIMD_log1pf16 +#define __DECL_SIMD_log1pf32 +#define __DECL_SIMD_log1pf64 +#define __DECL_SIMD_log1pf128 +#define __DECL_SIMD_log1pf32x +#define __DECL_SIMD_log1pf64x +#define __DECL_SIMD_log1pf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index bfe52a4666..aa4bc61aa4 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -119,7 +119,7 @@ __MATHCALL_VEC (exp10,, (_Mdouble_ __x)); __MATHCALL_VEC (expm1,, (_Mdouble_ __x)); /* Return log(1 + X). */ -__MATHCALL (log1p,, (_Mdouble_ __x)); +__MATHCALL_VEC (log1p,, (_Mdouble_ __x)); /* Return the base 2 signed integral exponent of X. */ __MATHCALL (logb,, (_Mdouble_ __x)); diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index fa8b016c5d..68b940606a 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -55,6 +55,7 @@ GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2v_expm1 F GLIBC_2.35 _ZGVbN2v_log10 F +GLIBC_2.35 _ZGVbN2v_log1p F GLIBC_2.35 _ZGVbN2v_log2 F GLIBC_2.35 _ZGVbN2v_sinh F GLIBC_2.35 _ZGVbN2vv_atan2 F @@ -68,6 +69,7 @@ GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4v_expm1f F GLIBC_2.35 _ZGVbN4v_log10f F +GLIBC_2.35 _ZGVbN4v_log1pf F GLIBC_2.35 _ZGVbN4v_log2f F GLIBC_2.35 _ZGVbN4v_sinhf F GLIBC_2.35 _ZGVbN4vv_atan2f F @@ -81,6 +83,7 @@ GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4v_expm1 F GLIBC_2.35 _ZGVcN4v_log10 F +GLIBC_2.35 _ZGVcN4v_log1p F GLIBC_2.35 _ZGVcN4v_log2 F GLIBC_2.35 _ZGVcN4v_sinh F GLIBC_2.35 _ZGVcN4vv_atan2 F @@ -94,6 +97,7 @@ GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8v_expm1f F GLIBC_2.35 _ZGVcN8v_log10f F +GLIBC_2.35 _ZGVcN8v_log1pf F GLIBC_2.35 _ZGVcN8v_log2f F GLIBC_2.35 _ZGVcN8v_sinhf F GLIBC_2.35 _ZGVcN8vv_atan2f F @@ -107,6 +111,7 @@ GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4v_expm1 F GLIBC_2.35 _ZGVdN4v_log10 F +GLIBC_2.35 _ZGVdN4v_log1p F GLIBC_2.35 _ZGVdN4v_log2 F GLIBC_2.35 _ZGVdN4v_sinh F GLIBC_2.35 _ZGVdN4vv_atan2 F @@ -120,6 +125,7 @@ GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8v_expm1f F GLIBC_2.35 _ZGVdN8v_log10f F +GLIBC_2.35 _ZGVdN8v_log1pf F GLIBC_2.35 _ZGVdN8v_log2f F GLIBC_2.35 _ZGVdN8v_sinhf F GLIBC_2.35 _ZGVdN8vv_atan2f F @@ -133,6 +139,7 @@ GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16v_expm1f F GLIBC_2.35 _ZGVeN16v_log10f F +GLIBC_2.35 _ZGVeN16v_log1pf F GLIBC_2.35 _ZGVeN16v_log2f F GLIBC_2.35 _ZGVeN16v_sinhf F GLIBC_2.35 _ZGVeN16vv_atan2f F @@ -146,6 +153,7 @@ GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8v_expm1 F GLIBC_2.35 _ZGVeN8v_log10 F +GLIBC_2.35 _ZGVeN8v_log1p F GLIBC_2.35 _ZGVeN8v_log2 F GLIBC_2.35 _ZGVeN8v_sinh F GLIBC_2.35 _ZGVeN8vv_atan2 F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 59d284a10a..14c9db3bb3 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -110,6 +110,10 @@ # define __DECL_SIMD_log2 __DECL_SIMD_x86_64 # undef __DECL_SIMD_log2f # define __DECL_SIMD_log2f __DECL_SIMD_x86_64 +# undef __DECL_SIMD_log1p +# define __DECL_SIMD_log1p __DECL_SIMD_x86_64 +# undef __DECL_SIMD_log1pf +# define __DECL_SIMD_log1pf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index a2ca9a203f..3dca196432 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -54,6 +54,8 @@ !GCC$ builtin (log10f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (log2) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (log2f) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (log1p) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (log1pf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -93,3 +95,5 @@ !GCC$ builtin (log10f) attributes simd (notinbranch) if('x32') !GCC$ builtin (log2) attributes simd (notinbranch) if('x32') !GCC$ builtin (log2f) attributes simd (notinbranch) if('x32') +!GCC$ builtin (log1p) attributes simd (notinbranch) if('x32') +!GCC$ builtin (log1pf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 8d6d0915af..378cb06d37 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -36,6 +36,7 @@ libmvec-funcs = \ hypot \ log \ log10 \ + log1p \ log2 \ pow \ sin \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 1b48c2d642..155fb115f3 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -23,6 +23,7 @@ libmvec { _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2v_expm1; _ZGVcN4v_expm1; _ZGVdN4v_expm1; _ZGVeN8v_expm1; _ZGVbN2v_log10; _ZGVcN4v_log10; _ZGVdN4v_log10; _ZGVeN8v_log10; + _ZGVbN2v_log1p; _ZGVcN4v_log1p; _ZGVdN4v_log1p; _ZGVeN8v_log1p; _ZGVbN2v_log2; _ZGVcN4v_log2; _ZGVdN4v_log2; _ZGVeN8v_log2; _ZGVbN2v_sinh; _ZGVcN4v_sinh; _ZGVdN4v_sinh; _ZGVeN8v_sinh; _ZGVbN2vv_atan2; _ZGVcN4vv_atan2; _ZGVdN4vv_atan2; _ZGVeN8vv_atan2; @@ -36,6 +37,7 @@ libmvec { _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4v_expm1f; _ZGVcN8v_expm1f; _ZGVdN8v_expm1f; _ZGVeN16v_expm1f; _ZGVbN4v_log10f; _ZGVcN8v_log10f; _ZGVdN8v_log10f; _ZGVeN16v_log10f; + _ZGVbN4v_log1pf; _ZGVcN8v_log1pf; _ZGVdN8v_log1pf; _ZGVeN16v_log1pf; _ZGVbN4v_log2f; _ZGVcN8v_log2f; _ZGVdN8v_log2f; _ZGVeN16v_log2f; _ZGVbN4v_sinhf; _ZGVcN8v_sinhf; _ZGVdN8v_sinhf; _ZGVeN16v_sinhf; _ZGVbN4vv_atan2f; _ZGVcN8vv_atan2f; _ZGVdN8vv_atan2f; _ZGVeN16vv_atan2f; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index 3b7f3cee6f..a2b15a795b 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1685,6 +1685,26 @@ float: 2 float128: 2 ldouble: 3 +Function: "log1p_vlen16": +float: 2 + +Function: "log1p_vlen2": +double: 1 + +Function: "log1p_vlen4": +double: 1 +float: 2 + +Function: "log1p_vlen4_avx2": +double: 1 + +Function: "log1p_vlen8": +double: 1 +float: 2 + +Function: "log1p_vlen8_avx2": +float: 2 + Function: "log2": double: 2 float: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core-sse2.S new file mode 100644 index 0000000000..8004088346 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized log1p, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_log1p _ZGVbN2v_log1p_sse2 +#include "../svml_d_log1p2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core.c new file mode 100644 index 0000000000..35ca620aba --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log1p, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_log1p +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_log1p, __GI__ZGVbN2v_log1p, __redirect__ZGVbN2v_log1p) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core_sse4.S new file mode 100644 index 0000000000..77bc155728 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p2_core_sse4.S @@ -0,0 +1,1395 @@ +/* Function log1p vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * 1+x = 2^k*(xh + xl) is computed in high-low parts; xh in [1,2) + * Get short reciprocal approximation Rcp ~ 1/xh + * R = (Rcp*xh - 1.0) + Rcp*xl + * log1p(x) = k*log(2.0) - log(Rcp) + poly(R) + * log(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog1p_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8208 +#define poly_coeff 12320 +#define ExpMask 12384 +#define Two10 12400 +#define MinLog1p 12416 +#define MaxLog1p 12432 +#define One 12448 +#define SgnMask 12464 +#define XThreshold 12480 +#define XhMask 12496 +#define Threshold 12512 +#define Bias 12528 +#define Bias1 12544 +#define ExpMask0 12560 +#define ExpMask2 12576 +#define L2 12592 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_log1p_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + movaps %xmm0, %xmm7 + +/* SgnMask used by all accuracies */ + movups SgnMask+__svml_dlog1p_data_internal(%rip), %xmm6 + lea -4218864+__svml_dlog1p_data_internal(%rip), %rsi + movaps %xmm6, %xmm8 + movaps %xmm7, %xmm15 + movups One+__svml_dlog1p_data_internal(%rip), %xmm0 + andps %xmm7, %xmm8 + cmpltpd XThreshold+__svml_dlog1p_data_internal(%rip), %xmm8 + cmpnlepd MaxLog1p+__svml_dlog1p_data_internal(%rip), %xmm15 + movaps %xmm0, %xmm4 + +/* compute 1+x as high, low parts */ + movaps %xmm0, %xmm9 + addpd %xmm7, %xmm4 + maxpd %xmm7, %xmm9 + orps XhMask+__svml_dlog1p_data_internal(%rip), %xmm8 + movaps %xmm0, %xmm5 + +/* preserve mantissa, set input exponent to 2^(-10) */ + movups ExpMask+__svml_dlog1p_data_internal(%rip), %xmm3 + andps %xmm8, %xmm4 + andps %xmm4, %xmm3 + +/* check range */ + movaps %xmm7, %xmm8 + orps Two10+__svml_dlog1p_data_internal(%rip), %xmm3 + +/* Compute SignMask for all accuracies, including EP */ + andnps %xmm7, %xmm6 + +/* reciprocal approximation good to at least 11 bits */ + cvtpd2ps %xmm3, %xmm10 + minpd %xmm7, %xmm5 + subpd %xmm4, %xmm9 + cmpltpd MinLog1p+__svml_dlog1p_data_internal(%rip), %xmm8 + addpd %xmm9, %xmm5 + movlhps %xmm10, %xmm10 + orps %xmm15, %xmm8 + rcpps %xmm10, %xmm11 + +/* combine and get argument value range mask */ + movmskpd %xmm8, %edx + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + movups .FLT_16(%rip), %xmm13 + +/* exponent of X needed to scale Xl */ + movdqu ExpMask0+__svml_dlog1p_data_internal(%rip), %xmm12 + cvtps2pd %xmm11, %xmm1 + addpd %xmm13, %xmm1 + subpd %xmm13, %xmm1 + +/* 2^ (-10-exp(X) ) */ + movdqu ExpMask2+__svml_dlog1p_data_internal(%rip), %xmm2 + pand %xmm4, %xmm12 + psubq %xmm12, %xmm2 + mulpd %xmm1, %xmm3 + +/* scale DblRcp */ + mulpd %xmm1, %xmm2 + subpd %xmm0, %xmm3 + +/* + * argument reduction + * VQFMS( D, R, X, DblRcp1, One ); + */ + mulpd %xmm2, %xmm5 + addpd %xmm5, %xmm3 + +/* exponent*log(2.0) */ + movups Threshold+__svml_dlog1p_data_internal(%rip), %xmm10 + +/* exponent bits */ + psrlq $20, %xmm4 + pshufd $221, %xmm4, %xmm14 + +/* + * prepare table index + * table lookup + */ + movaps %xmm1, %xmm4 + cmpltpd %xmm1, %xmm10 + +/* biased exponent in DP format */ + cvtdq2pd %xmm14, %xmm0 + +/* polynomial */ + movups poly_coeff+__svml_dlog1p_data_internal(%rip), %xmm1 + movaps %xmm3, %xmm5 + mulpd %xmm3, %xmm1 + mulpd %xmm3, %xmm5 + addpd poly_coeff+16+__svml_dlog1p_data_internal(%rip), %xmm1 + movups poly_coeff+32+__svml_dlog1p_data_internal(%rip), %xmm2 + psrlq $40, %xmm4 + mulpd %xmm3, %xmm2 + mulpd %xmm5, %xmm1 + addpd poly_coeff+48+__svml_dlog1p_data_internal(%rip), %xmm2 + movd %xmm4, %eax + andps Bias+__svml_dlog1p_data_internal(%rip), %xmm10 + addpd %xmm1, %xmm2 + +/* reconstruction */ + mulpd %xmm2, %xmm5 + orps Bias1+__svml_dlog1p_data_internal(%rip), %xmm10 + pshufd $2, %xmm4, %xmm9 + subpd %xmm10, %xmm0 + addpd %xmm5, %xmm3 + movd %xmm9, %ecx + mulpd L2+__svml_dlog1p_data_internal(%rip), %xmm0 + movslq %eax, %rax + movslq %ecx, %rcx + movsd (%rsi,%rax), %xmm11 + movhpd (%rsi,%rcx), %xmm11 + addpd %xmm3, %xmm11 + addpd %xmm11, %xmm0 + +/* OR in the Sign of input argument to produce correct log1p(-0) */ + orps %xmm6, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm0 xmm7 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm7, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call log1p@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN2v_log1p_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dlog1p_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(16)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(16)) VUINT32 poly_coeff[4][2][2]; + __declspec(align(16)) VUINT32 ExpMask[2][2]; + __declspec(align(16)) VUINT32 Two10[2][2]; + __declspec(align(16)) VUINT32 MinLog1p[2][2]; + __declspec(align(16)) VUINT32 MaxLog1p[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 SgnMask[2][2]; + __declspec(align(16)) VUINT32 XThreshold[2][2]; + __declspec(align(16)) VUINT32 XhMask[2][2]; + __declspec(align(16)) VUINT32 Threshold[2][2]; + __declspec(align(16)) VUINT32 Bias[2][2]; + __declspec(align(16)) VUINT32 Bias1[2][2]; + __declspec(align(16)) VUINT32 ExpMask0[2][2]; + __declspec(align(16)) VUINT32 ExpMask2[2][2]; + __declspec(align(16)) VUINT32 L2[2][2]; +} __svml_dlog1p_data_internal; +#endif +__svml_dlog1p_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 16 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 16 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 16 + .quad 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 16 + .quad 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 16 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 16 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 16 + .quad 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 16 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 16 + .quad 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 16 + .quad 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 16 + .quad 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 16 + .quad 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 16 + .quad 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 16 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + .align 16 + .type __svml_dlog1p_data_internal,@object + .size __svml_dlog1p_data_internal,.-__svml_dlog1p_data_internal + .space 96, 0x00 + .align 16 + +.FLT_16: + .long 0x00000000,0x43380000,0x00000000,0x43380000 + .type .FLT_16,@object + .size .FLT_16,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core-sse.S new file mode 100644 index 0000000000..ec01af680c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized log1p, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_log1p _ZGVdN4v_log1p_sse_wrapper +#include "../svml_d_log1p4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core.c new file mode 100644 index 0000000000..808f3224ef --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log1p, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_log1p +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_log1p, __GI__ZGVdN4v_log1p, __redirect__ZGVdN4v_log1p) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core_avx2.S new file mode 100644 index 0000000000..5c2c0464fc --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p4_core_avx2.S @@ -0,0 +1,1380 @@ +/* Function log1p vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * 1+x = 2^k*(xh + xl) is computed in high-low parts; xh in [1,2) + * Get short reciprocal approximation Rcp ~ 1/xh + * R = (Rcp*xh - 1.0) + Rcp*xl + * log1p(x) = k*log(2.0) - log(Rcp) + poly(R) + * log(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog1p_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8224 +#define poly_coeff 12352 +#define ExpMask 12480 +#define Two10 12512 +#define MinLog1p 12544 +#define MaxLog1p 12576 +#define One 12608 +#define SgnMask 12640 +#define XThreshold 12672 +#define XhMask 12704 +#define Threshold 12736 +#define Bias 12768 +#define Bias1 12800 +#define ExpMask0 12832 +#define ExpMask2 12864 +#define L2 12896 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_log1p_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea -4218848+__svml_dlog1p_data_internal(%rip), %r8 + +/* SgnMask used by all accuracies */ + vmovupd SgnMask+__svml_dlog1p_data_internal(%rip), %ymm12 + vmovupd One+__svml_dlog1p_data_internal(%rip), %ymm7 + +/* 2^ (-10-exp(X) ) */ + vmovupd ExpMask2+__svml_dlog1p_data_internal(%rip), %ymm3 + vmovapd %ymm0, %ymm9 + vandpd %ymm12, %ymm9, %ymm10 + vcmplt_oqpd XThreshold+__svml_dlog1p_data_internal(%rip), %ymm10, %ymm11 + vaddpd %ymm7, %ymm9, %ymm13 + +/* compute 1+x as high, low parts */ + vmaxpd %ymm9, %ymm7, %ymm15 + vminpd %ymm9, %ymm7, %ymm6 + vorpd XhMask+__svml_dlog1p_data_internal(%rip), %ymm11, %ymm14 + vandpd %ymm14, %ymm13, %ymm4 + +/* preserve mantissa, set input exponent to 2^(-10) */ + vandpd ExpMask+__svml_dlog1p_data_internal(%rip), %ymm4, %ymm5 + vorpd Two10+__svml_dlog1p_data_internal(%rip), %ymm5, %ymm5 + +/* reciprocal approximation good to at least 11 bits */ + vcvtpd2ps %ymm5, %xmm2 + vsubpd %ymm4, %ymm15, %ymm0 + +/* check range */ + vcmplt_oqpd MinLog1p+__svml_dlog1p_data_internal(%rip), %ymm9, %ymm15 + vrcpps %xmm2, %xmm1 + vaddpd %ymm0, %ymm6, %ymm6 + vcmpnle_uqpd MaxLog1p+__svml_dlog1p_data_internal(%rip), %ymm9, %ymm0 + vcvtps2pd %xmm1, %ymm11 + +/* exponent of X needed to scale Xl */ + vandps ExpMask0+__svml_dlog1p_data_internal(%rip), %ymm4, %ymm10 + vpsubq %ymm10, %ymm3, %ymm13 + +/* exponent bits */ + vpsrlq $20, %ymm4, %ymm4 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + vroundpd $0, %ymm11, %ymm3 + +/* scale DblRcp */ + vmulpd %ymm13, %ymm3, %ymm2 + +/* exponent*log(2.0) */ + vmovupd Threshold+__svml_dlog1p_data_internal(%rip), %ymm13 + vfmsub213pd %ymm7, %ymm3, %ymm5 + +/* Compute SignMask for all accuracies, including EP */ + vandnpd %ymm9, %ymm12, %ymm8 + vorpd %ymm0, %ymm15, %ymm7 + +/* + * prepare table index + * table lookup + */ + vpsrlq $40, %ymm3, %ymm0 + +/* + * argument reduction + * VQFMS( D, R, X, DblRcp1, One ); + */ + vfmadd213pd %ymm5, %ymm2, %ymm6 + vmovupd poly_coeff+64+__svml_dlog1p_data_internal(%rip), %ymm2 + vcmplt_oqpd %ymm3, %ymm13, %ymm3 + vmulpd %ymm6, %ymm6, %ymm5 + vfmadd213pd poly_coeff+96+__svml_dlog1p_data_internal(%rip), %ymm6, %ymm2 + +/* combine and get argument value range mask */ + vmovmskpd %ymm7, %eax + vextractf128 $1, %ymm4, %xmm12 + vshufps $221, %xmm12, %xmm4, %xmm14 + +/* biased exponent in DP format */ + vcvtdq2pd %xmm14, %ymm1 + vandpd Bias+__svml_dlog1p_data_internal(%rip), %ymm3, %ymm14 + vorpd Bias1+__svml_dlog1p_data_internal(%rip), %ymm14, %ymm15 + vsubpd %ymm15, %ymm1, %ymm1 + vmulpd L2+__svml_dlog1p_data_internal(%rip), %ymm1, %ymm3 + +/* polynomial */ + vmovupd poly_coeff+__svml_dlog1p_data_internal(%rip), %ymm1 + vfmadd213pd poly_coeff+32+__svml_dlog1p_data_internal(%rip), %ymm6, %ymm1 + vfmadd213pd %ymm2, %ymm5, %ymm1 + +/* reconstruction */ + vfmadd213pd %ymm6, %ymm5, %ymm1 + vextractf128 $1, %ymm0, %xmm10 + vmovd %xmm0, %edx + vmovd %xmm10, %esi + movslq %edx, %rdx + vpextrd $2, %xmm0, %ecx + movslq %esi, %rsi + vpextrd $2, %xmm10, %edi + movslq %ecx, %rcx + movslq %edi, %rdi + vmovsd (%r8,%rdx), %xmm4 + vmovsd (%r8,%rsi), %xmm11 + vmovhpd (%r8,%rcx), %xmm4, %xmm7 + vmovhpd (%r8,%rdi), %xmm11, %xmm12 + vinsertf128 $1, %xmm12, %ymm7, %ymm0 + vaddpd %ymm1, %ymm0, %ymm6 + vaddpd %ymm6, %ymm3, %ymm0 + +/* OR in the Sign of input argument to produce correct log1p(-0) */ + vorpd %ymm8, %ymm0, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm9 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm9, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call log1p@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_log1p_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dlog1p_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(32)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(32)) VUINT32 poly_coeff[4][4][2]; + __declspec(align(32)) VUINT32 ExpMask[4][2]; + __declspec(align(32)) VUINT32 Two10[4][2]; + __declspec(align(32)) VUINT32 MinLog1p[4][2]; + __declspec(align(32)) VUINT32 MaxLog1p[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 SgnMask[4][2]; + __declspec(align(32)) VUINT32 XThreshold[4][2]; + __declspec(align(32)) VUINT32 XhMask[4][2]; + __declspec(align(32)) VUINT32 Threshold[4][2]; + __declspec(align(32)) VUINT32 Bias[4][2]; + __declspec(align(32)) VUINT32 Bias1[4][2]; + __declspec(align(32)) VUINT32 ExpMask0[4][2]; + __declspec(align(32)) VUINT32 ExpMask2[4][2]; + __declspec(align(32)) VUINT32 L2[4][2]; +} __svml_dlog1p_data_internal; +#endif +__svml_dlog1p_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 32 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 32 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 32 + .quad 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 32 + .quad 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 32 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 32 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 32 + .quad 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 32 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 32 + .quad 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 32 + .quad 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 32 + .quad 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 32 + .quad 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 32 + .quad 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 32 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + .align 32 + .type __svml_dlog1p_data_internal,@object + .size __svml_dlog1p_data_internal,.-__svml_dlog1p_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core-avx2.S new file mode 100644 index 0000000000..ca174a5f52 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized log1p, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_log1p _ZGVeN8v_log1p_avx2_wrapper +#include "../svml_d_log1p8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core.c new file mode 100644 index 0000000000..0aa35ec8c5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized log1p, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_log1p +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_log1p, __GI__ZGVeN8v_log1p, __redirect__ZGVeN8v_log1p) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core_avx512.S new file mode 100644 index 0000000000..5e38ff8d39 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_log1p8_core_avx512.S @@ -0,0 +1,317 @@ +/* Function log1p vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * 1+x = 2^k*(xh + xl) is computed in high-low parts; xh in [1,2) + * Get short reciprocal approximation Rcp ~ 1/xh + * R = (Rcp*xh - 1.0) + Rcp*xl + * log1p(x) = k*log(2.0) - log(Rcp) + poly(R) + * log(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_dlog1p_data_internal_avx512 + */ +#define Log_tbl 0 +#define One 128 +#define SgnMask 192 +#define C075 256 +#define poly_coeff9 320 +#define poly_coeff8 384 +#define poly_coeff7 448 +#define poly_coeff6 512 +#define poly_coeff5 576 +#define poly_coeff4 640 +#define poly_coeff3 704 +#define poly_coeff2 768 +#define L2 832 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_log1p_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups One+__svml_dlog1p_data_internal_avx512(%rip), %zmm7 + vmovups SgnMask+__svml_dlog1p_data_internal_avx512(%rip), %zmm14 + vmovaps %zmm0, %zmm9 + vaddpd {rn-sae}, %zmm9, %zmm7, %zmm11 + vandpd %zmm14, %zmm9, %zmm8 + +/* compute 1+x as high, low parts */ + vmaxpd {sae}, %zmm9, %zmm7, %zmm10 + vminpd {sae}, %zmm9, %zmm7, %zmm12 + +/* GetMant(x), normalized to [1,2) for x>=0, NaN for x<0 */ + vgetmantpd $8, {sae}, %zmm11, %zmm6 + +/* GetExp(x) */ + vgetexppd {sae}, %zmm11, %zmm5 + vsubpd {rn-sae}, %zmm10, %zmm11, %zmm13 + +/* DblRcp ~ 1/Mantissa */ + vrcp14pd %zmm6, %zmm15 + +/* Start polynomial evaluation */ + vmovups poly_coeff9+__svml_dlog1p_data_internal_avx512(%rip), %zmm10 + vmovups poly_coeff7+__svml_dlog1p_data_internal_avx512(%rip), %zmm11 + +/* Xl */ + vsubpd {rn-sae}, %zmm13, %zmm12, %zmm2 + vxorpd %zmm14, %zmm5, %zmm3 + +/* round DblRcp to 4 fractional bits (RN mode, no Precision exception) */ + vrndscalepd $88, {sae}, %zmm15, %zmm4 + vmovups poly_coeff5+__svml_dlog1p_data_internal_avx512(%rip), %zmm12 + vmovups poly_coeff6+__svml_dlog1p_data_internal_avx512(%rip), %zmm14 + vmovups poly_coeff3+__svml_dlog1p_data_internal_avx512(%rip), %zmm13 + +/* Xl*2^(-Expon) */ + vscalefpd {rn-sae}, %zmm3, %zmm2, %zmm1 + +/* Reduced argument: R = DblRcp*(Mantissa+Xl) - 1 */ + vfmsub213pd {rn-sae}, %zmm7, %zmm4, %zmm6 + vmovups __svml_dlog1p_data_internal_avx512(%rip), %zmm3 + +/* + * Table lookup + * Prepare exponent correction: DblRcp<0.75? + */ + vmovups C075+__svml_dlog1p_data_internal_avx512(%rip), %zmm2 + +/* Prepare table index */ + vpsrlq $48, %zmm4, %zmm0 + vfmadd231pd {rn-sae}, %zmm4, %zmm1, %zmm6 + vmovups poly_coeff8+__svml_dlog1p_data_internal_avx512(%rip), %zmm1 + vcmppd $17, {sae}, %zmm2, %zmm4, %k1 + vcmppd $4, {sae}, %zmm6, %zmm6, %k0 + vfmadd231pd {rn-sae}, %zmm6, %zmm10, %zmm1 + vmovups poly_coeff4+__svml_dlog1p_data_internal_avx512(%rip), %zmm10 + vfmadd231pd {rn-sae}, %zmm6, %zmm11, %zmm14 + vmovups L2+__svml_dlog1p_data_internal_avx512(%rip), %zmm4 + vpermt2pd Log_tbl+64+__svml_dlog1p_data_internal_avx512(%rip), %zmm0, %zmm3 + +/* add 1 to Expon if DblRcp<0.75 */ + vaddpd {rn-sae}, %zmm7, %zmm5, %zmm5{%k1} + +/* R^2 */ + vmulpd {rn-sae}, %zmm6, %zmm6, %zmm0 + vfmadd231pd {rn-sae}, %zmm6, %zmm12, %zmm10 + vmovups poly_coeff2+__svml_dlog1p_data_internal_avx512(%rip), %zmm12 + vmulpd {rn-sae}, %zmm0, %zmm0, %zmm15 + vfmadd231pd {rn-sae}, %zmm6, %zmm13, %zmm12 + vfmadd213pd {rn-sae}, %zmm14, %zmm0, %zmm1 + kmovw %k0, %edx + vfmadd213pd {rn-sae}, %zmm12, %zmm0, %zmm10 + +/* polynomial */ + vfmadd213pd {rn-sae}, %zmm10, %zmm15, %zmm1 + vfmadd213pd {rn-sae}, %zmm6, %zmm0, %zmm1 + vaddpd {rn-sae}, %zmm1, %zmm3, %zmm6 + vfmadd213pd {rn-sae}, %zmm6, %zmm4, %zmm5 + vorpd %zmm8, %zmm5, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm9 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm9, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call log1p@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_log1p_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dlog1p_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl[16][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 SgnMask[8][2]; + __declspec(align(64)) VUINT32 C075[8][2]; + __declspec(align(64)) VUINT32 poly_coeff9[8][2]; + __declspec(align(64)) VUINT32 poly_coeff8[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 L2[8][2]; + } __svml_dlog1p_data_internal_avx512; +#endif +__svml_dlog1p_data_internal_avx512: + /*== Log_tbl ==*/ + .quad 0x0000000000000000 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfd1675cababa60e + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd739d7f6bbd007 + .quad 0x3fd269621134db92 + .quad 0x3fcf991c6cb3b379 + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc1178e8227e47c + .quad 0x3fb9335e5d594989 + .quad 0x3fb08598b59e3a07 + .quad 0x3fa0415d89e74444 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 64 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 + /*== C075 0.75 ==*/ + .align 64 + .quad 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000, 0x3fe8000000000000 + /*== poly_coeff9 ==*/ + .align 64 + .quad 0x3fbC81CD309D7C70, 0x3fbC81CD309D7C70, 0x3fbC81CD309D7C70, 0x3fbC81CD309D7C70, 0x3fbC81CD309D7C70, 0x3fbC81CD309D7C70, 0x3fbC81CD309D7C70, 0x3fbC81CD309D7C70 + /*== poly_coeff8 ==*/ + .align 64 + .quad 0xbfc007357E93AF62, 0xbfc007357E93AF62, 0xbfc007357E93AF62, 0xbfc007357E93AF62, 0xbfc007357E93AF62, 0xbfc007357E93AF62, 0xbfc007357E93AF62, 0xbfc007357E93AF62 + /*== poly_coeff7 ==*/ + .align 64 + .quad 0x3fc249229CEE81EF, 0x3fc249229CEE81EF, 0x3fc249229CEE81EF, 0x3fc249229CEE81EF, 0x3fc249229CEE81EF, 0x3fc249229CEE81EF, 0x3fc249229CEE81EF, 0x3fc249229CEE81EF + /*== poly_coeff6 ==*/ + .align 64 + .quad 0xbfc55553FB28DB06, 0xbfc55553FB28DB06, 0xbfc55553FB28DB06, 0xbfc55553FB28DB06, 0xbfc55553FB28DB06, 0xbfc55553FB28DB06, 0xbfc55553FB28DB06, 0xbfc55553FB28DB06 + /*== poly_coeff5 ==*/ + .align 64 + .quad 0x3fc9999999CC9F5C, 0x3fc9999999CC9F5C, 0x3fc9999999CC9F5C, 0x3fc9999999CC9F5C, 0x3fc9999999CC9F5C, 0x3fc9999999CC9F5C, 0x3fc9999999CC9F5C, 0x3fc9999999CC9F5C + /*== poly_coeff4 ==*/ + .align 64 + .quad 0xbfd00000000C05BD, 0xbfd00000000C05BD, 0xbfd00000000C05BD, 0xbfd00000000C05BD, 0xbfd00000000C05BD, 0xbfd00000000C05BD, 0xbfd00000000C05BD, 0xbfd00000000C05BD + /*== poly_coeff3 ==*/ + .align 64 + .quad 0x3fd5555555555466, 0x3fd5555555555466, 0x3fd5555555555466, 0x3fd5555555555466, 0x3fd5555555555466, 0x3fd5555555555466, 0x3fd5555555555466, 0x3fd5555555555466 + /*== poly_coeff2 ==*/ + .align 64 + .quad 0xbfdFFFFFFFFFFFC6, 0xbfdFFFFFFFFFFFC6, 0xbfdFFFFFFFFFFFC6, 0xbfdFFFFFFFFFFFC6, 0xbfdFFFFFFFFFFFC6, 0xbfdFFFFFFFFFFFC6, 0xbfdFFFFFFFFFFFC6, 0xbfdFFFFFFFFFFFC6 + /*== L2 = log(2) ==*/ + .align 64 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + .align 64 + .type __svml_dlog1p_data_internal_avx512,@object + .size __svml_dlog1p_data_internal_avx512,.-__svml_dlog1p_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core-avx2.S new file mode 100644 index 0000000000..3c0a0a01a2 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized log1pf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_log1pf _ZGVeN16v_log1pf_avx2_wrapper +#include "../svml_s_log1pf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core.c new file mode 100644 index 0000000000..9af1320547 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log1pf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_log1pf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_log1pf, __GI__ZGVeN16v_log1pf, + __redirect__ZGVeN16v_log1pf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core_avx512.S new file mode 100644 index 0000000000..78b2fe417f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf16_core_avx512.S @@ -0,0 +1,271 @@ +/* Function log1pf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * 1+x = 2^k*(xh + xl) is computed in high-low parts; xh in [1,2) + * Get short reciprocal approximation Rcp ~ 1/xh + * R = (Rcp*xh - 1.0) + Rcp*xl + * log1p(x) = k*log(2.0) - log(Rcp) + poly(R) + * log(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog1p_data_internal + */ +#define SgnMask 0 +#define sOne 64 +#define sPoly_1 128 +#define sPoly_2 192 +#define sPoly_3 256 +#define sPoly_4 320 +#define sPoly_5 384 +#define sPoly_6 448 +#define sPoly_7 512 +#define sPoly_8 576 +#define iHiDelta 640 +#define iLoRange 704 +#define iBrkValue 768 +#define iOffExpoMask 832 +#define sLn2 896 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_log1pf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups sOne+__svml_slog1p_data_internal(%rip), %zmm2 + +/* reduction: compute r,n */ + vmovups iBrkValue+__svml_slog1p_data_internal(%rip), %zmm12 + vmovups SgnMask+__svml_slog1p_data_internal(%rip), %zmm4 + vmovaps %zmm0, %zmm3 + +/* compute 1+x as high, low parts */ + vmaxps {sae}, %zmm3, %zmm2, %zmm5 + vminps {sae}, %zmm3, %zmm2, %zmm7 + vandnps %zmm3, %zmm4, %zmm1 + vpternlogd $255, %zmm4, %zmm4, %zmm4 + vaddps {rn-sae}, %zmm7, %zmm5, %zmm9 + vpsubd %zmm12, %zmm9, %zmm10 + vsubps {rn-sae}, %zmm9, %zmm5, %zmm6 + +/* check argument value ranges */ + vpaddd iHiDelta+__svml_slog1p_data_internal(%rip), %zmm9, %zmm8 + vpsrad $23, %zmm10, %zmm13 + vmovups sPoly_5+__svml_slog1p_data_internal(%rip), %zmm9 + vpcmpd $5, iLoRange+__svml_slog1p_data_internal(%rip), %zmm8, %k1 + vpslld $23, %zmm13, %zmm14 + vaddps {rn-sae}, %zmm7, %zmm6, %zmm15 + vcvtdq2ps {rn-sae}, %zmm13, %zmm0 + vpsubd %zmm14, %zmm2, %zmm13 + vmovups sPoly_8+__svml_slog1p_data_internal(%rip), %zmm7 + vmovups sPoly_1+__svml_slog1p_data_internal(%rip), %zmm14 + vmulps {rn-sae}, %zmm13, %zmm15, %zmm6 + vpandd iOffExpoMask+__svml_slog1p_data_internal(%rip), %zmm10, %zmm11 + vpaddd %zmm12, %zmm11, %zmm5 + vmovups sPoly_4+__svml_slog1p_data_internal(%rip), %zmm10 + vmovups sPoly_3+__svml_slog1p_data_internal(%rip), %zmm11 + vmovups sPoly_2+__svml_slog1p_data_internal(%rip), %zmm12 + +/* polynomial evaluation */ + vsubps {rn-sae}, %zmm2, %zmm5, %zmm2 + vaddps {rn-sae}, %zmm6, %zmm2, %zmm15 + vmovups sPoly_7+__svml_slog1p_data_internal(%rip), %zmm2 + vfmadd231ps {rn-sae}, %zmm15, %zmm7, %zmm2 + vpandnd %zmm8, %zmm8, %zmm4{%k1} + vmovups sPoly_6+__svml_slog1p_data_internal(%rip), %zmm8 + +/* combine and get argument value range mask */ + vptestmd %zmm4, %zmm4, %k0 + vfmadd213ps {rn-sae}, %zmm8, %zmm15, %zmm2 + kmovw %k0, %edx + vfmadd213ps {rn-sae}, %zmm9, %zmm15, %zmm2 + vfmadd213ps {rn-sae}, %zmm10, %zmm15, %zmm2 + vfmadd213ps {rn-sae}, %zmm11, %zmm15, %zmm2 + vfmadd213ps {rn-sae}, %zmm12, %zmm15, %zmm2 + vfmadd213ps {rn-sae}, %zmm14, %zmm15, %zmm2 + vmulps {rn-sae}, %zmm15, %zmm2, %zmm4 + vfmadd213ps {rn-sae}, %zmm15, %zmm15, %zmm4 + +/* final reconstruction */ + vmovups sLn2+__svml_slog1p_data_internal(%rip), %zmm15 + vfmadd213ps {rn-sae}, %zmm4, %zmm15, %zmm0 + vorps %zmm1, %zmm0, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm3, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call log1pf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_log1pf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_slog1p_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 SgnMask[16][1]; + __declspec(align(64)) VUINT32 sOne[16][1]; + __declspec(align(64)) VUINT32 sPoly[8][16][1]; + __declspec(align(64)) VUINT32 iHiDelta[16][1]; + __declspec(align(64)) VUINT32 iLoRange[16][1]; + __declspec(align(64)) VUINT32 iBrkValue[16][1]; + __declspec(align(64)) VUINT32 iOffExpoMask[16][1]; + __declspec(align(64)) VUINT32 sLn2[16][1]; +} __svml_slog1p_data_internal; +#endif +__svml_slog1p_data_internal: + /*== SgnMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== sOne = SP 1.0 ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 64 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iHiDelta = SP 80000000-7f000000 ==*/ + .align 64 + .long 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000 + /*== iLoRange = SP 00800000+iHiDelta ==*/ + .align 64 + .long 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000 + /*== iBrkValue = SP 2/3 ==*/ + .align 64 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 64 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sLn2 = SP ln(2) ==*/ + .align 64 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 64 + .type __svml_slog1p_data_internal,@object + .size __svml_slog1p_data_internal,.-__svml_slog1p_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core-sse2.S new file mode 100644 index 0000000000..913c8290c8 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized log1pf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_log1pf _ZGVbN4v_log1pf_sse2 +#include "../svml_s_log1pf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core.c new file mode 100644 index 0000000000..b6aff48023 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log1pf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_log1pf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_log1pf, __GI__ZGVbN4v_log1pf, + __redirect__ZGVbN4v_log1pf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core_sse4.S new file mode 100644 index 0000000000..ef1bae58c0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf4_core_sse4.S @@ -0,0 +1,252 @@ +/* Function log1pf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * 1+x = 2^k*(xh + xl) is computed in high-low parts; xh in [1,2) + * Get short reciprocal approximation Rcp ~ 1/xh + * R = (Rcp*xh - 1.0) + Rcp*xl + * log1p(x) = k*log(2.0) - log(Rcp) + poly(R) + * log(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog1p_data_internal + */ +#define SgnMask 0 +#define sOne 16 +#define sPoly 32 +#define iHiDelta 160 +#define iLoRange 176 +#define iBrkValue 192 +#define iOffExpoMask 208 +#define sLn2 224 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_log1pf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movups sOne+__svml_slog1p_data_internal(%rip), %xmm7 + +/* compute 1+x as high, low parts */ + movaps %xmm7, %xmm1 + movaps %xmm7, %xmm5 + maxps %xmm0, %xmm1 + minps %xmm0, %xmm5 + movaps %xmm1, %xmm4 + +/* check argument value ranges */ + movdqu iHiDelta+__svml_slog1p_data_internal(%rip), %xmm2 + addps %xmm5, %xmm4 + +/* reduction: compute r,n */ + movdqu iBrkValue+__svml_slog1p_data_internal(%rip), %xmm3 + paddd %xmm4, %xmm2 + movdqu iOffExpoMask+__svml_slog1p_data_internal(%rip), %xmm8 + subps %xmm4, %xmm1 + psubd %xmm3, %xmm4 + addps %xmm1, %xmm5 + pand %xmm4, %xmm8 + psrad $23, %xmm4 + cvtdq2ps %xmm4, %xmm10 + pslld $23, %xmm4 + movaps %xmm7, %xmm1 + paddd %xmm3, %xmm8 + psubd %xmm4, %xmm1 + mulps %xmm5, %xmm1 + +/* polynomial evaluation */ + subps %xmm7, %xmm8 + +/* final reconstruction */ + mulps sLn2+__svml_slog1p_data_internal(%rip), %xmm10 + addps %xmm8, %xmm1 + movups sPoly+112+__svml_slog1p_data_internal(%rip), %xmm9 + mulps %xmm1, %xmm9 + movdqu iLoRange+__svml_slog1p_data_internal(%rip), %xmm6 + pcmpgtd %xmm2, %xmm6 + addps sPoly+96+__svml_slog1p_data_internal(%rip), %xmm9 + +/* combine and get argument value range mask */ + movmskps %xmm6, %edx + movups SgnMask+__svml_slog1p_data_internal(%rip), %xmm11 + mulps %xmm1, %xmm9 + andnps %xmm0, %xmm11 + addps sPoly+80+__svml_slog1p_data_internal(%rip), %xmm9 + mulps %xmm1, %xmm9 + addps sPoly+64+__svml_slog1p_data_internal(%rip), %xmm9 + mulps %xmm1, %xmm9 + addps sPoly+48+__svml_slog1p_data_internal(%rip), %xmm9 + mulps %xmm1, %xmm9 + addps sPoly+32+__svml_slog1p_data_internal(%rip), %xmm9 + mulps %xmm1, %xmm9 + addps sPoly+16+__svml_slog1p_data_internal(%rip), %xmm9 + mulps %xmm1, %xmm9 + addps sPoly+__svml_slog1p_data_internal(%rip), %xmm9 + mulps %xmm1, %xmm9 + mulps %xmm1, %xmm9 + addps %xmm9, %xmm1 + addps %xmm10, %xmm1 + orps %xmm11, %xmm1 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm1, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm1, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm1 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm1 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call log1pf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_log1pf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_slog1p_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 SgnMask[4][1]; + __declspec(align(16)) VUINT32 sOne[4][1]; + __declspec(align(16)) VUINT32 sPoly[8][4][1]; + __declspec(align(16)) VUINT32 iHiDelta[4][1]; + __declspec(align(16)) VUINT32 iLoRange[4][1]; + __declspec(align(16)) VUINT32 iBrkValue[4][1]; + __declspec(align(16)) VUINT32 iOffExpoMask[4][1]; + __declspec(align(16)) VUINT32 sLn2[4][1]; +} __svml_slog1p_data_internal; +#endif +__svml_slog1p_data_internal: + /*== SgnMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== sOne = SP 1.0 ==*/ + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 16 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iHiDelta = SP 80000000-7f000000 ==*/ + .align 16 + .long 0x01000000, 0x01000000, 0x01000000, 0x01000000 + /*== iLoRange = SP 00800000+iHiDelta ==*/ + .align 16 + .long 0x01800000, 0x01800000, 0x01800000, 0x01800000 + /*== iBrkValue = SP 2/3 ==*/ + .align 16 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 16 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sLn2 = SP ln(2) ==*/ + .align 16 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 16 + .type __svml_slog1p_data_internal,@object + .size __svml_slog1p_data_internal,.-__svml_slog1p_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core-sse.S new file mode 100644 index 0000000000..c0b97d89e6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized log1pf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_log1pf _ZGVdN8v_log1pf_sse_wrapper +#include "../svml_s_log1pf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core.c new file mode 100644 index 0000000000..a2bbe37129 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized log1pf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_log1pf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_log1pf, __GI__ZGVdN8v_log1pf, + __redirect__ZGVdN8v_log1pf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core_avx2.S new file mode 100644 index 0000000000..957dc23e3f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_log1pf8_core_avx2.S @@ -0,0 +1,254 @@ +/* Function log1pf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * 1+x = 2^k*(xh + xl) is computed in high-low parts; xh in [1,2) + * Get short reciprocal approximation Rcp ~ 1/xh + * R = (Rcp*xh - 1.0) + Rcp*xl + * log1p(x) = k*log(2.0) - log(Rcp) + poly(R) + * log(Rcp) is tabulated + * + * + */ + +/* Offsets for data table __svml_slog1p_data_internal + */ +#define SgnMask 0 +#define sOne 32 +#define sPoly 64 +#define iHiDelta 320 +#define iLoRange 352 +#define iBrkValue 384 +#define iOffExpoMask 416 +#define sLn2 448 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_log1pf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + vmovups sOne+__svml_slog1p_data_internal(%rip), %ymm2 + +/* reduction: compute r,n */ + vmovups iBrkValue+__svml_slog1p_data_internal(%rip), %ymm13 + vmovups SgnMask+__svml_slog1p_data_internal(%rip), %ymm4 + vmovups iLoRange+__svml_slog1p_data_internal(%rip), %ymm8 + vmovaps %ymm0, %ymm3 + +/* compute 1+x as high, low parts */ + vmaxps %ymm3, %ymm2, %ymm5 + vminps %ymm3, %ymm2, %ymm6 + vaddps %ymm6, %ymm5, %ymm10 + vpsubd %ymm13, %ymm10, %ymm11 + +/* check argument value ranges */ + vpaddd iHiDelta+__svml_slog1p_data_internal(%rip), %ymm10, %ymm9 + vsubps %ymm10, %ymm5, %ymm7 + vpsrad $23, %ymm11, %ymm14 + vpand iOffExpoMask+__svml_slog1p_data_internal(%rip), %ymm11, %ymm12 + vpslld $23, %ymm14, %ymm15 + vcvtdq2ps %ymm14, %ymm0 + vpsubd %ymm15, %ymm2, %ymm14 + vandnps %ymm3, %ymm4, %ymm1 + vaddps %ymm7, %ymm6, %ymm4 + vpaddd %ymm13, %ymm12, %ymm6 + vmulps %ymm4, %ymm14, %ymm7 + +/* polynomial evaluation */ + vsubps %ymm2, %ymm6, %ymm2 + vpcmpgtd %ymm9, %ymm8, %ymm5 + vmovups sPoly+224+__svml_slog1p_data_internal(%rip), %ymm8 + vaddps %ymm2, %ymm7, %ymm9 + vfmadd213ps sPoly+192+__svml_slog1p_data_internal(%rip), %ymm9, %ymm8 + vfmadd213ps sPoly+160+__svml_slog1p_data_internal(%rip), %ymm9, %ymm8 + vfmadd213ps sPoly+128+__svml_slog1p_data_internal(%rip), %ymm9, %ymm8 + vfmadd213ps sPoly+96+__svml_slog1p_data_internal(%rip), %ymm9, %ymm8 + vfmadd213ps sPoly+64+__svml_slog1p_data_internal(%rip), %ymm9, %ymm8 + vfmadd213ps sPoly+32+__svml_slog1p_data_internal(%rip), %ymm9, %ymm8 + vfmadd213ps sPoly+__svml_slog1p_data_internal(%rip), %ymm9, %ymm8 + vmulps %ymm8, %ymm9, %ymm10 + vfmadd213ps %ymm9, %ymm9, %ymm10 + +/* final reconstruction */ + vfmadd132ps sLn2+__svml_slog1p_data_internal(%rip), %ymm10, %ymm0 + +/* combine and get argument value range mask */ + vmovmskps %ymm5, %edx + vorps %ymm1, %ymm0, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm3, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call log1pf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_log1pf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_slog1p_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 SgnMask[8][1]; + __declspec(align(32)) VUINT32 sOne[8][1]; + __declspec(align(32)) VUINT32 sPoly[8][8][1]; + __declspec(align(32)) VUINT32 iHiDelta[8][1]; + __declspec(align(32)) VUINT32 iLoRange[8][1]; + __declspec(align(32)) VUINT32 iBrkValue[8][1]; + __declspec(align(32)) VUINT32 iOffExpoMask[8][1]; + __declspec(align(32)) VUINT32 sLn2[8][1]; +} __svml_slog1p_data_internal; +#endif +__svml_slog1p_data_internal: + /*== SgnMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== sOne = SP 1.0 ==*/ + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 32 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iHiDelta = SP 80000000-7f000000 ==*/ + .align 32 + .long 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000, 0x01000000 + /*== iLoRange = SP 00800000+iHiDelta ==*/ + .align 32 + .long 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000, 0x01800000 + /*== iBrkValue = SP 2/3 ==*/ + .align 32 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 32 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sLn2 = SP ln(2) ==*/ + .align 32 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 32 + .type __svml_slog1p_data_internal,@object + .size __svml_slog1p_data_internal,.-__svml_slog1p_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_log1p2_core.S b/sysdeps/x86_64/fpu/svml_d_log1p2_core.S new file mode 100644 index 0000000000..e3f01717d9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log1p2_core.S @@ -0,0 +1,29 @@ +/* Function log1p vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_log1p) +WRAPPER_IMPL_SSE2 log1p +END (_ZGVbN2v_log1p) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_log1p) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_log1p4_core.S b/sysdeps/x86_64/fpu/svml_d_log1p4_core.S new file mode 100644 index 0000000000..49beb96183 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log1p4_core.S @@ -0,0 +1,29 @@ +/* Function log1p vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_log1p) +WRAPPER_IMPL_AVX _ZGVbN2v_log1p +END (_ZGVdN4v_log1p) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_log1p) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_log1p4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_log1p4_core_avx.S new file mode 100644 index 0000000000..8b89768b7c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log1p4_core_avx.S @@ -0,0 +1,25 @@ +/* Function log1p vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_log1p) +WRAPPER_IMPL_AVX _ZGVbN2v_log1p +END (_ZGVcN4v_log1p) diff --git a/sysdeps/x86_64/fpu/svml_d_log1p8_core.S b/sysdeps/x86_64/fpu/svml_d_log1p8_core.S new file mode 100644 index 0000000000..54b4d4ede8 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_log1p8_core.S @@ -0,0 +1,25 @@ +/* Function log1p vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_log1p) +WRAPPER_IMPL_AVX512 _ZGVdN4v_log1p +END (_ZGVeN8v_log1p) diff --git a/sysdeps/x86_64/fpu/svml_s_log1pf16_core.S b/sysdeps/x86_64/fpu/svml_s_log1pf16_core.S new file mode 100644 index 0000000000..2c953d00fb --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log1pf16_core.S @@ -0,0 +1,25 @@ +/* Function log1pf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_log1pf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_log1pf +END (_ZGVeN16v_log1pf) diff --git a/sysdeps/x86_64/fpu/svml_s_log1pf4_core.S b/sysdeps/x86_64/fpu/svml_s_log1pf4_core.S new file mode 100644 index 0000000000..6f68762eaa --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log1pf4_core.S @@ -0,0 +1,29 @@ +/* Function log1pf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_log1pf) +WRAPPER_IMPL_SSE2 log1pf +END (_ZGVbN4v_log1pf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_log1pf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_log1pf8_core.S b/sysdeps/x86_64/fpu/svml_s_log1pf8_core.S new file mode 100644 index 0000000000..74f81283b1 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log1pf8_core.S @@ -0,0 +1,29 @@ +/* Function log1pf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_log1pf) +WRAPPER_IMPL_AVX _ZGVbN4v_log1pf +END (_ZGVdN8v_log1pf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_log1pf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_log1pf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_log1pf8_core_avx.S new file mode 100644 index 0000000000..f33be0e904 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_log1pf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function log1pf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_log1pf) +WRAPPER_IMPL_AVX _ZGVbN4v_log1pf +END (_ZGVcN8v_log1pf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx.c new file mode 100644 index 0000000000..18aa6aaeaa --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log1p.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx2.c new file mode 100644 index 0000000000..18aa6aaeaa --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log1p.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx512f.c new file mode 100644 index 0000000000..18aa6aaeaa --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log1p-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-log1p.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-log1p.c b/sysdeps/x86_64/fpu/test-double-libmvec-log1p.c new file mode 100644 index 0000000000..40937f987a --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-log1p.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC log1p +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 08c91ff634..38359b05e3 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVbN2v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVbN2vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVbN2v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVbN2v_log2) +VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVbN2v_log1p) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index a2fb0de309..17701e7731 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVdN4v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVdN4vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVdN4v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVdN4v_log2) +VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVdN4v_log1p) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index dc65a4ee25..bba62b2446 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVcN4v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVcN4vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVcN4v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVcN4v_log2) +VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVcN4v_log1p) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 253ee8c906..8a04e13a07 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrt), _ZGVeN8v_cbrt) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVeN8vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVeN8v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVeN8v_log2) +VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVeN8v_log1p) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx.c new file mode 100644 index 0000000000..3395decaf4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log1pf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx2.c new file mode 100644 index 0000000000..3395decaf4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log1pf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx512f.c new file mode 100644 index 0000000000..3395decaf4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-log1pf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-log1pf.c b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf.c new file mode 100644 index 0000000000..1b36069ded --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-log1pf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC log1pf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 1c7db5146c..706f52c618 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVeN16v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVeN16vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVeN16v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVeN16v_log2f) +VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVeN16v_log1pf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 8ec51603b3..ceace4c53a 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVbN4v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVbN4vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVbN4v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVbN4v_log2f) +VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVbN4v_log1pf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 1cb4553c7a..06a4753409 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVdN8v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVdN8vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVdN8v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVdN8v_log2f) +VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVdN8v_log1pf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 6ecc1792bb..a87e5298e0 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -40,6 +40,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (cbrtf), _ZGVcN8v_cbrtf) VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVcN8vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVcN8v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVcN8v_log2f) +VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVcN8v_log1pf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573818 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=D4TryYH8; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm8D4Y6zz9sVq for ; Wed, 29 Dec 2021 07:18:52 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 56B4B385843D for ; Tue, 28 Dec 2021 20:18:50 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 56B4B385843D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722730; bh=uwVGgMtWSSGQ56cRkpiIM5L1U1FROkEr+V1Xds2ZIcI=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=D4TryYH8YAzcv26t5ITnGbcxNsHcYIShUEQlhcsdxha7VOtDBNIuaJEX0gaUO5PsO 27AMT7ruwXBdAU0rvUxRCGiZjSm6N8Px96cVzz+br6zLudVDIwEwTR7w2C45jnlIGG 06T7zCOY7zNPrnX3B6T0wKc2NZlkDmjY3C50humU= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id 7DF72385841F for ; Tue, 28 Dec 2021 20:11:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7DF72385841F X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="228246034" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="228246034" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="510272250" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga007.jf.intel.com with ESMTP; 28 Dec 2021 12:11:32 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsi016522; Tue, 28 Dec 2021 12:11:32 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 14/18] x86-64: Add vector atanh/atanhf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:26 -0800 Message-Id: <20211228201130.737370-15-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized atanh/atanhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector atanh/atanhf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_atanh2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_atanh2_core.c | 27 + .../fpu/multiarch/svml_d_atanh2_core_sse4.S | 1516 +++++++++++++++++ .../fpu/multiarch/svml_d_atanh4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_atanh4_core.c | 27 + .../fpu/multiarch/svml_d_atanh4_core_avx2.S | 1476 ++++++++++++++++ .../fpu/multiarch/svml_d_atanh8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_atanh8_core.c | 27 + .../fpu/multiarch/svml_d_atanh8_core_avx512.S | 401 +++++ .../fpu/multiarch/svml_s_atanhf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_atanhf16_core.c | 28 + .../multiarch/svml_s_atanhf16_core_avx512.S | 393 +++++ .../fpu/multiarch/svml_s_atanhf4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_atanhf4_core.c | 28 + .../fpu/multiarch/svml_s_atanhf4_core_sse4.S | 361 ++++ .../fpu/multiarch/svml_s_atanhf8_core-sse.S | 20 + .../fpu/multiarch/svml_s_atanhf8_core.c | 28 + .../fpu/multiarch/svml_s_atanhf8_core_avx2.S | 335 ++++ sysdeps/x86_64/fpu/svml_d_atanh2_core.S | 29 + sysdeps/x86_64/fpu/svml_d_atanh4_core.S | 29 + sysdeps/x86_64/fpu/svml_d_atanh4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_atanh8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_atanhf16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_atanhf4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_atanhf8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_atanhf8_core_avx.S | 25 + .../fpu/test-double-libmvec-atanh-avx.c | 1 + .../fpu/test-double-libmvec-atanh-avx2.c | 1 + .../fpu/test-double-libmvec-atanh-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-atanh.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-atanhf-avx.c | 1 + .../fpu/test-float-libmvec-atanhf-avx2.c | 1 + .../fpu/test-float-libmvec-atanhf-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-atanhf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 5054 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atanh2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atanh4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atanh4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_atanh8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanhf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanhf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanhf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_atanhf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-atanh.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-atanhf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 845246fab9..bb7380a446 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -252,4 +252,15 @@ #define __DECL_SIMD_log1pf32x #define __DECL_SIMD_log1pf64x #define __DECL_SIMD_log1pf128x + +#define __DECL_SIMD_atanh +#define __DECL_SIMD_atanhf +#define __DECL_SIMD_atanhl +#define __DECL_SIMD_atanhf16 +#define __DECL_SIMD_atanhf32 +#define __DECL_SIMD_atanhf64 +#define __DECL_SIMD_atanhf128 +#define __DECL_SIMD_atanhf32x +#define __DECL_SIMD_atanhf64x +#define __DECL_SIMD_atanhf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index aa4bc61aa4..04dd9c5d1b 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -86,7 +86,7 @@ __MATHCALL (acosh,, (_Mdouble_ __x)); /* Hyperbolic arc sine of X. */ __MATHCALL (asinh,, (_Mdouble_ __x)); /* Hyperbolic arc tangent of X. */ -__MATHCALL (atanh,, (_Mdouble_ __x)); +__MATHCALL_VEC (atanh,, (_Mdouble_ __x)); #endif /* Exponential and logarithmic functions. */ diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 68b940606a..2d389912b1 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -49,6 +49,7 @@ GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F +GLIBC_2.35 _ZGVbN2v_atanh F GLIBC_2.35 _ZGVbN2v_cbrt F GLIBC_2.35 _ZGVbN2v_cosh F GLIBC_2.35 _ZGVbN2v_exp10 F @@ -63,6 +64,7 @@ GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F +GLIBC_2.35 _ZGVbN4v_atanhf F GLIBC_2.35 _ZGVbN4v_cbrtf F GLIBC_2.35 _ZGVbN4v_coshf F GLIBC_2.35 _ZGVbN4v_exp10f F @@ -77,6 +79,7 @@ GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F +GLIBC_2.35 _ZGVcN4v_atanh F GLIBC_2.35 _ZGVcN4v_cbrt F GLIBC_2.35 _ZGVcN4v_cosh F GLIBC_2.35 _ZGVcN4v_exp10 F @@ -91,6 +94,7 @@ GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F +GLIBC_2.35 _ZGVcN8v_atanhf F GLIBC_2.35 _ZGVcN8v_cbrtf F GLIBC_2.35 _ZGVcN8v_coshf F GLIBC_2.35 _ZGVcN8v_exp10f F @@ -105,6 +109,7 @@ GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F +GLIBC_2.35 _ZGVdN4v_atanh F GLIBC_2.35 _ZGVdN4v_cbrt F GLIBC_2.35 _ZGVdN4v_cosh F GLIBC_2.35 _ZGVdN4v_exp10 F @@ -119,6 +124,7 @@ GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F +GLIBC_2.35 _ZGVdN8v_atanhf F GLIBC_2.35 _ZGVdN8v_cbrtf F GLIBC_2.35 _ZGVdN8v_coshf F GLIBC_2.35 _ZGVdN8v_exp10f F @@ -133,6 +139,7 @@ GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F +GLIBC_2.35 _ZGVeN16v_atanhf F GLIBC_2.35 _ZGVeN16v_cbrtf F GLIBC_2.35 _ZGVeN16v_coshf F GLIBC_2.35 _ZGVeN16v_exp10f F @@ -147,6 +154,7 @@ GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F +GLIBC_2.35 _ZGVeN8v_atanh F GLIBC_2.35 _ZGVeN8v_cbrt F GLIBC_2.35 _ZGVeN8v_cosh F GLIBC_2.35 _ZGVeN8v_exp10 F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 14c9db3bb3..4937b6811f 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -114,6 +114,10 @@ # define __DECL_SIMD_log1p __DECL_SIMD_x86_64 # undef __DECL_SIMD_log1pf # define __DECL_SIMD_log1pf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_atanh +# define __DECL_SIMD_atanh __DECL_SIMD_x86_64 +# undef __DECL_SIMD_atanhf +# define __DECL_SIMD_atanhf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 3dca196432..da39c08ba9 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -56,6 +56,8 @@ !GCC$ builtin (log2f) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (log1p) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (log1pf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (atanh) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (atanhf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -97,3 +99,5 @@ !GCC$ builtin (log2f) attributes simd (notinbranch) if('x32') !GCC$ builtin (log1p) attributes simd (notinbranch) if('x32') !GCC$ builtin (log1pf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (atanh) attributes simd (notinbranch) if('x32') +!GCC$ builtin (atanhf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 378cb06d37..de87544259 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -26,6 +26,7 @@ libmvec-funcs = \ asin \ atan \ atan2 \ + atanh \ cbrt \ cos \ cosh \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 155fb115f3..df0ea83711 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -17,6 +17,7 @@ libmvec { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; + _ZGVbN2v_atanh; _ZGVcN4v_atanh; _ZGVdN4v_atanh; _ZGVeN8v_atanh; _ZGVbN2v_cbrt; _ZGVcN4v_cbrt; _ZGVdN4v_cbrt; _ZGVeN8v_cbrt; _ZGVbN2v_cosh; _ZGVcN4v_cosh; _ZGVdN4v_cosh; _ZGVeN8v_cosh; _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; @@ -31,6 +32,7 @@ libmvec { _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; + _ZGVbN4v_atanhf; _ZGVcN8v_atanhf; _ZGVdN8v_atanhf; _ZGVeN16v_atanhf; _ZGVbN4v_cbrtf; _ZGVcN8v_cbrtf; _ZGVdN8v_cbrtf; _ZGVeN16v_cbrtf; _ZGVbN4v_coshf; _ZGVcN8v_coshf; _ZGVdN8v_coshf; _ZGVeN16v_coshf; _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index a2b15a795b..09a46190b6 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -248,6 +248,26 @@ float: 3 float128: 4 ldouble: 5 +Function: "atanh_vlen16": +float: 1 + +Function: "atanh_vlen2": +double: 1 + +Function: "atanh_vlen4": +double: 1 +float: 1 + +Function: "atanh_vlen4_avx2": +double: 1 + +Function: "atanh_vlen8": +double: 1 +float: 1 + +Function: "atanh_vlen8_avx2": +float: 1 + Function: "cabs": double: 1 float128: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core-sse2.S new file mode 100644 index 0000000000..b154ab8649 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized atanh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_atanh _ZGVbN2v_atanh_sse2 +#include "../svml_d_atanh2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core.c new file mode 100644 index 0000000000..138190e568 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized atanh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_atanh +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_atanh, __GI__ZGVbN2v_atanh, __redirect__ZGVbN2v_atanh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core_sse4.S new file mode 100644 index 0000000000..a72bc473ee --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh2_core_sse4.S @@ -0,0 +1,1516 @@ +/* Function atanh vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute atanh(x) as 0.5 * log((1 + x)/(1 - x)) + * + * Special cases: + * + * atanh(0) = 0 + * atanh(+1) = +INF + * atanh(-1) = -INF + * atanh(x) = NaN if |x| > 1, or if x is a NaN or INF + * + */ + +/* Offsets for data table __svml_datanh_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8208 +#define poly_coeff 12320 +#define ExpMask 12384 +#define Two10 12400 +#define MinLog1p 12416 +#define MaxLog1p 12432 +#define One 12448 +#define SgnMask 12464 +#define XThreshold 12480 +#define XhMask 12496 +#define Threshold 12512 +#define Bias 12528 +#define Bias1 12544 +#define ExpMask0 12560 +#define ExpMask2 12576 +#define L2 12592 +#define dHalf 12608 +#define dSign 12624 +#define dTopMask12 12640 +#define dTopMask41 12656 +#define TinyRange 12672 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_atanh_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + movaps %xmm0, %xmm12 + movups SgnMask+__svml_datanh_data_internal(%rip), %xmm7 + lea -4218864+__svml_datanh_data_internal(%rip), %rsi + +/* Load the constant 1 and a sign mask */ + movups One+__svml_datanh_data_internal(%rip), %xmm11 + +/* Strip off the sign, so treat X as positive until right at the end */ + movaps %xmm7, %xmm14 + andps %xmm12, %xmm14 + movaps %xmm11, %xmm15 + subpd %xmm14, %xmm15 + movups dTopMask41+__svml_datanh_data_internal(%rip), %xmm2 + movaps %xmm11, %xmm5 + movaps %xmm2, %xmm0 + +/* + * Compute V = 2 * X trivially, and UHi + U_lo = 1 - X in two pieces, + * the upper part UHi being <= 41 bits long. Then we have + * atanh(X) = 1/2 * log((1 + X) / (1 - X)) = 1/2 * log1p(V / (UHi + ULo)). + */ + movaps %xmm14, %xmm6 + andps %xmm15, %xmm0 + +/* + * Check whether |X| < 1, in which case we use the main function. + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN < 1). + */ + movaps %xmm14, %xmm13 + +/* + * Now compute R = 1/(UHi+ULo) * (1 - E) and the error term E + * The first FMR is exact (we force R to 12 bits just in case it + * isn't already, to make absolutely sure), and since E is ~ 2^-12, + * the rounding error in the other one is acceptable. + */ + cvtpd2ps %xmm0, %xmm1 + subpd %xmm15, %xmm5 + addpd %xmm14, %xmm6 + subpd %xmm0, %xmm15 + cmpnltpd %xmm11, %xmm13 + subpd %xmm14, %xmm5 + movmskpd %xmm13, %edx + movlhps %xmm1, %xmm1 + movaps %xmm14, %xmm9 + rcpps %xmm1, %xmm4 + addpd %xmm15, %xmm5 + cmpltpd TinyRange+__svml_datanh_data_internal(%rip), %xmm9 + cvtps2pd %xmm4, %xmm14 + andps dTopMask12+__svml_datanh_data_internal(%rip), %xmm14 + movaps %xmm11, %xmm13 + mulpd %xmm14, %xmm0 + mulpd %xmm14, %xmm5 + subpd %xmm0, %xmm13 + +/* + * Split V as well into upper 41 bits and lower part, so that we can get + * a preliminary quotient estimate without rounding error. + */ + andps %xmm6, %xmm2 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * later incorporating L into the reduced argument. + * compute 1+x as high, low parts + */ + movaps %xmm11, %xmm0 + subpd %xmm5, %xmm13 + subpd %xmm2, %xmm6 + +/* Hence get initial quotient estimate QHi + QLo = R * VHi + R * VLo */ + mulpd %xmm14, %xmm2 + mulpd %xmm6, %xmm14 + +/* + * Compute D = E + E^2 + E^3 + E^4 + E^5 + * = E + (E + E^2) (E + E * E^2) + */ + movaps %xmm13, %xmm6 + movaps %xmm13, %xmm3 + mulpd %xmm13, %xmm6 + mulpd %xmm6, %xmm3 + addpd %xmm13, %xmm6 + addpd %xmm13, %xmm3 + mulpd %xmm3, %xmm6 + addpd %xmm6, %xmm13 + +/* + * Compute R * (VHi + VLo) * (1 + E + E^2 + E^3 + E^4 + E^5) + * = R * (VHi + VLo) * (1 + D) + * = QHi + (QHi * D + QLo + QLo * D) + */ + movaps %xmm13, %xmm1 + movaps %xmm11, %xmm5 + mulpd %xmm14, %xmm13 + mulpd %xmm2, %xmm1 + addpd %xmm13, %xmm14 + addpd %xmm14, %xmm1 + +/* + * Now finally accumulate the high and low parts of the + * argument to log1p, H + L, with a final compensated summation. + */ + addpd %xmm1, %xmm2 + maxpd %xmm2, %xmm0 + minpd %xmm2, %xmm5 + andps %xmm7, %xmm2 + movaps %xmm0, %xmm4 + cmpltpd XThreshold+__svml_datanh_data_internal(%rip), %xmm2 + addpd %xmm5, %xmm4 + orps XhMask+__svml_datanh_data_internal(%rip), %xmm2 + movaps %xmm12, %xmm10 + +/* preserve mantissa, set input exponent to 2^(-10) */ + movups ExpMask+__svml_datanh_data_internal(%rip), %xmm7 + andps %xmm2, %xmm4 + andps %xmm4, %xmm7 + +/* exponent bits */ + movaps %xmm4, %xmm6 + orps Two10+__svml_datanh_data_internal(%rip), %xmm7 + psrlq $20, %xmm6 + +/* reciprocal approximation good to at least 11 bits */ + cvtpd2ps %xmm7, %xmm1 + subpd %xmm4, %xmm0 + mulpd %xmm12, %xmm10 + addpd %xmm0, %xmm5 + addpd %xmm12, %xmm10 + movlhps %xmm1, %xmm1 + rcpps %xmm1, %xmm15 + cvtps2pd %xmm15, %xmm3 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + movups .FLT_21(%rip), %xmm1 + addpd %xmm1, %xmm3 + subpd %xmm1, %xmm3 + +/* exponent of X needed to scale Xl */ + movdqu ExpMask0+__svml_datanh_data_internal(%rip), %xmm0 + +/* + * prepare table index + * table lookup + */ + movaps %xmm3, %xmm13 + +/* 2^ (-10-exp(X) ) */ + movdqu ExpMask2+__svml_datanh_data_internal(%rip), %xmm2 + pand %xmm4, %xmm0 + psubq %xmm0, %xmm2 + +/* scale DblRcp */ + mulpd %xmm3, %xmm2 + +/* argument reduction */ + mulpd %xmm2, %xmm4 + mulpd %xmm2, %xmm5 + subpd %xmm11, %xmm4 + addpd %xmm5, %xmm4 + +/* polynomial */ + movups poly_coeff+__svml_datanh_data_internal(%rip), %xmm11 + psrlq $40, %xmm13 + mulpd %xmm4, %xmm11 + movd %xmm13, %eax + pshufd $221, %xmm6, %xmm7 + +/* exponent*log(2.0) */ + movups Threshold+__svml_datanh_data_internal(%rip), %xmm6 + cmpltpd %xmm3, %xmm6 + addpd poly_coeff+16+__svml_datanh_data_internal(%rip), %xmm11 + +/* biased exponent in DP format */ + cvtdq2pd %xmm7, %xmm1 + movaps %xmm4, %xmm3 + mulpd %xmm4, %xmm3 + movups poly_coeff+32+__svml_datanh_data_internal(%rip), %xmm2 + mulpd %xmm4, %xmm2 + mulpd %xmm3, %xmm11 + addpd poly_coeff+48+__svml_datanh_data_internal(%rip), %xmm2 + addpd %xmm11, %xmm2 + +/* reconstruction */ + mulpd %xmm2, %xmm3 + andps Bias+__svml_datanh_data_internal(%rip), %xmm6 + orps Bias1+__svml_datanh_data_internal(%rip), %xmm6 + pshufd $2, %xmm13, %xmm14 + subpd %xmm6, %xmm1 + addpd %xmm3, %xmm4 + movd %xmm14, %ecx + mulpd L2+__svml_datanh_data_internal(%rip), %xmm1 + movslq %eax, %rax + movslq %ecx, %rcx + +/* Record the sign for eventual reincorporation. */ + movups dSign+__svml_datanh_data_internal(%rip), %xmm8 + andps %xmm12, %xmm8 + movsd (%rsi,%rax), %xmm0 + +/* Or the sign bit in with the tiny result to handle atanh(-0) correctly */ + orps %xmm8, %xmm10 + movhpd (%rsi,%rcx), %xmm0 + andps %xmm9, %xmm10 + addpd %xmm4, %xmm0 + addpd %xmm0, %xmm1 + +/* Finally, halve the result and reincorporate the sign */ + movups dHalf+__svml_datanh_data_internal(%rip), %xmm4 + movaps %xmm9, %xmm0 + pxor %xmm8, %xmm4 + mulpd %xmm1, %xmm4 + andnps %xmm4, %xmm0 + orps %xmm10, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm0 xmm12 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm12, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call atanh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN2v_atanh_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_datanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(16)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(16)) VUINT32 poly_coeff[4][2][2]; + __declspec(align(16)) VUINT32 ExpMask[2][2]; + __declspec(align(16)) VUINT32 Two10[2][2]; + __declspec(align(16)) VUINT32 MinLog1p[2][2]; + __declspec(align(16)) VUINT32 MaxLog1p[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 SgnMask[2][2]; + __declspec(align(16)) VUINT32 XThreshold[2][2]; + __declspec(align(16)) VUINT32 XhMask[2][2]; + __declspec(align(16)) VUINT32 Threshold[2][2]; + __declspec(align(16)) VUINT32 Bias[2][2]; + __declspec(align(16)) VUINT32 Bias1[2][2]; + __declspec(align(16)) VUINT32 ExpMask0[2][2]; + __declspec(align(16)) VUINT32 ExpMask2[2][2]; + __declspec(align(16)) VUINT32 L2[2][2]; + __declspec(align(16)) VUINT32 dHalf[2][2]; + __declspec(align(16)) VUINT32 dSign[2][2]; + __declspec(align(16)) VUINT32 dTopMask12[2][2]; + __declspec(align(16)) VUINT32 dTopMask41[2][2]; + __declspec(align(16)) VUINT32 TinyRange[2][2]; +} __svml_datanh_data_internal; +#endif +__svml_datanh_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 16 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 16 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 16 + .quad 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 16 + .quad 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 16 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 16 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 16 + .quad 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 16 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 16 + .quad 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 16 + .quad 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 16 + .quad 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 16 + .quad 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 16 + .quad 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 16 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + /*== dHalf ==*/ + .align 16 + .quad 0x3FE0000000000000, 0x3FE0000000000000 + /*== dSign ==*/ + .align 16 + .quad 0x8000000000000000, 0x8000000000000000 + /*== dTopMask12 ==*/ + .align 16 + .quad 0xFFFFFE0000000000, 0xFFFFFE0000000000 + /*== dTopMask41 ==*/ + .align 16 + .quad 0xFFFFFFFFFFFFF000, 0xFFFFFFFFFFFFF000 + /*== dTinyRange ==*/ + .align 16 + .quad 0x0350000000000000, 0x0350000000000000 + .align 16 + .type __svml_datanh_data_internal,@object + .size __svml_datanh_data_internal,.-__svml_datanh_data_internal + .align 16 + +.FLT_21: + .long 0x00000000,0x43380000,0x00000000,0x43380000 + .type .FLT_21,@object + .size .FLT_21,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core-sse.S new file mode 100644 index 0000000000..a39cbb7595 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized atanh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_atanh _ZGVdN4v_atanh_sse_wrapper +#include "../svml_d_atanh4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core.c new file mode 100644 index 0000000000..e8ef343ae7 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized atanh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_atanh +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_atanh, __GI__ZGVdN4v_atanh, __redirect__ZGVdN4v_atanh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core_avx2.S new file mode 100644 index 0000000000..833a70785f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh4_core_avx2.S @@ -0,0 +1,1476 @@ +/* Function atanh vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute atanh(x) as 0.5 * log((1 + x)/(1 - x)) + * + * Special cases: + * + * atanh(0) = 0 + * atanh(+1) = +INF + * atanh(-1) = -INF + * atanh(x) = NaN if |x| > 1, or if x is a NaN or INF + * + */ + +/* Offsets for data table __svml_datanh_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8224 +#define poly_coeff 12352 +#define ExpMask 12480 +#define Two10 12512 +#define MinLog1p 12544 +#define MaxLog1p 12576 +#define One 12608 +#define SgnMask 12640 +#define XThreshold 12672 +#define XhMask 12704 +#define Threshold 12736 +#define Bias 12768 +#define Bias1 12800 +#define ExpMask0 12832 +#define ExpMask2 12864 +#define L2 12896 +#define dHalf 12928 +#define dSign 12960 +#define dTopMask12 12992 +#define dTopMask41 13024 +#define TinyRange 13056 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_atanh_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea -4218848+__svml_datanh_data_internal(%rip), %r8 + vmovupd SgnMask+__svml_datanh_data_internal(%rip), %ymm7 + +/* Load the constant 1 and a sign mask */ + vmovupd One+__svml_datanh_data_internal(%rip), %ymm11 + vmovapd %ymm0, %ymm12 + +/* Strip off the sign, so treat X as positive until right at the end */ + vandpd %ymm7, %ymm12, %ymm0 + vsubpd %ymm0, %ymm11, %ymm6 + +/* + * Check whether |X| < 1, in which case we use the main function. + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN < 1). + */ + vcmpnlt_uqpd %ymm11, %ymm0, %ymm13 + vcmplt_oqpd TinyRange+__svml_datanh_data_internal(%rip), %ymm0, %ymm10 + vsubpd %ymm6, %ymm11, %ymm15 + +/* + * Compute V = 2 * X trivially, and UHi + U_lo = 1 - X in two pieces, + * the upper part UHi being <= 41 bits long. Then we have + * atanh(X) = 1/2 * log((1 + X) / (1 - X)) = 1/2 * log1p(V / (UHi + ULo)). + */ + vaddpd %ymm0, %ymm0, %ymm3 + vcvtpd2ps %ymm6, %xmm5 + vsubpd %ymm0, %ymm15, %ymm1 + vrcpps %xmm5, %xmm4 + vmovapd %ymm12, %ymm14 + vfmadd213pd %ymm12, %ymm12, %ymm14 + vcvtps2pd %xmm4, %ymm2 + +/* Record the sign for eventual reincorporation. */ + vandpd dSign+__svml_datanh_data_internal(%rip), %ymm12, %ymm9 + +/* Or the sign bit in with the tiny result to handle atanh(-0) correctly */ + vorpd %ymm9, %ymm14, %ymm8 + vandpd dTopMask12+__svml_datanh_data_internal(%rip), %ymm2, %ymm14 + +/* No need to split dU when FMA is available */ + vfnmadd213pd %ymm11, %ymm14, %ymm6 + vfnmadd231pd %ymm14, %ymm1, %ymm6 + +/* + * Compute D = E + E^2 + E^3 + E^4 + E^5 + * = E + (E + E^2) (E + E * E^2) + * Only saves when FMA is available + */ + vmovapd %ymm11, %ymm0 + vmovapd %ymm6, %ymm5 + vfmadd231pd %ymm6, %ymm6, %ymm0 + vfmadd213pd %ymm6, %ymm6, %ymm5 + vfmadd213pd %ymm11, %ymm0, %ymm5 + vmovmskpd %ymm13, %eax + +/* + * Split V as well into upper 41 bits and lower part, so that we can get + * a preliminary quotient estimate without rounding error. + */ + vandpd dTopMask41+__svml_datanh_data_internal(%rip), %ymm3, %ymm13 + vsubpd %ymm13, %ymm3, %ymm15 + +/* Hence get initial quotient estimate QHi + QLo = R * VHi + R * VLo */ + vmulpd %ymm13, %ymm14, %ymm2 + vmulpd %ymm5, %ymm6, %ymm0 + vmulpd %ymm15, %ymm14, %ymm4 + +/* 2^ (-10-exp(X) ) */ + vmovupd ExpMask2+__svml_datanh_data_internal(%rip), %ymm15 + +/* + * Compute R * (VHi + VLo) * (1 + E + E^2 + E^3 + E^4 + E^5) + * = R * (VHi + VLo) * (1 + D) + * = QHi + (QHi * D + QLo + QLo * D) + */ + vmulpd %ymm0, %ymm2, %ymm6 + vfmadd213pd %ymm4, %ymm4, %ymm0 + vaddpd %ymm0, %ymm6, %ymm5 + +/* + * Now finally accumulate the high and low parts of the + * argument to log1p, H + L, with a final compensated summation. + */ + vaddpd %ymm5, %ymm2, %ymm4 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * later incorporating L into the reduced argument. + * compute 1+x as high, low parts + */ + vmaxpd %ymm4, %ymm11, %ymm1 + vminpd %ymm4, %ymm11, %ymm3 + vandpd %ymm7, %ymm4, %ymm7 + vcmplt_oqpd XThreshold+__svml_datanh_data_internal(%rip), %ymm7, %ymm0 + vaddpd %ymm3, %ymm1, %ymm5 + vorpd XhMask+__svml_datanh_data_internal(%rip), %ymm0, %ymm4 + vandpd %ymm4, %ymm5, %ymm5 + +/* preserve mantissa, set input exponent to 2^(-10) */ + vandpd ExpMask+__svml_datanh_data_internal(%rip), %ymm5, %ymm6 + vorpd Two10+__svml_datanh_data_internal(%rip), %ymm6, %ymm7 + +/* reciprocal approximation good to at least 11 bits */ + vcvtpd2ps %ymm7, %xmm13 + vsubpd %ymm5, %ymm1, %ymm2 + vrcpps %xmm13, %xmm14 + vaddpd %ymm2, %ymm3, %ymm4 + vcvtps2pd %xmm14, %ymm3 + +/* exponent bits */ + vpsrlq $20, %ymm5, %ymm2 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + vroundpd $0, %ymm3, %ymm3 + +/* + * prepare table index + * table lookup + */ + vpsrlq $40, %ymm3, %ymm13 + +/* exponent of X needed to scale Xl */ + vandps ExpMask0+__svml_datanh_data_internal(%rip), %ymm5, %ymm0 + vpsubq %ymm0, %ymm15, %ymm6 + +/* Finally, halve the result and reincorporate the sign */ + vxorpd dHalf+__svml_datanh_data_internal(%rip), %ymm9, %ymm9 + vmovd %xmm13, %edx + vextractf128 $1, %ymm13, %xmm0 + movslq %edx, %rdx + vpextrd $2, %xmm13, %ecx + movslq %ecx, %rcx + vmovd %xmm0, %esi + vmovsd (%r8,%rdx), %xmm14 + vmovhpd (%r8,%rcx), %xmm14, %xmm15 + +/* exponent*log(2.0) */ + vmovupd Threshold+__svml_datanh_data_internal(%rip), %ymm14 + movslq %esi, %rsi + vpextrd $2, %xmm0, %edi + movslq %edi, %rdi + vextractf128 $1, %ymm2, %xmm1 + vshufps $221, %xmm1, %xmm2, %xmm7 + +/* scale DblRcp */ + vmulpd %ymm6, %ymm3, %ymm2 + vmovsd (%r8,%rsi), %xmm6 + +/* biased exponent in DP format */ + vcvtdq2pd %xmm7, %ymm1 + vmovhpd (%r8,%rdi), %xmm6, %xmm7 + vcmplt_oqpd %ymm3, %ymm14, %ymm3 + +/* argument reduction */ + vfmsub213pd %ymm11, %ymm2, %ymm5 + vmulpd %ymm2, %ymm4, %ymm11 + vmovupd poly_coeff+64+__svml_datanh_data_internal(%rip), %ymm2 + vaddpd %ymm11, %ymm5, %ymm5 + vandpd Bias+__svml_datanh_data_internal(%rip), %ymm3, %ymm3 + vorpd Bias1+__svml_datanh_data_internal(%rip), %ymm3, %ymm6 + vsubpd %ymm6, %ymm1, %ymm1 + vfmadd213pd poly_coeff+96+__svml_datanh_data_internal(%rip), %ymm5, %ymm2 + vmulpd %ymm5, %ymm5, %ymm4 + vmulpd L2+__svml_datanh_data_internal(%rip), %ymm1, %ymm3 + +/* polynomial */ + vmovupd poly_coeff+__svml_datanh_data_internal(%rip), %ymm1 + vfmadd213pd poly_coeff+32+__svml_datanh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213pd %ymm2, %ymm4, %ymm1 + +/* reconstruction */ + vfmadd213pd %ymm5, %ymm4, %ymm1 + vinsertf128 $1, %xmm7, %ymm15, %ymm0 + vaddpd %ymm1, %ymm0, %ymm0 + vaddpd %ymm0, %ymm3, %ymm6 + vmulpd %ymm6, %ymm9, %ymm0 + vblendvpd %ymm10, %ymm8, %ymm0, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm12 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm12, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call atanh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_atanh_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_datanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(32)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(32)) VUINT32 poly_coeff[4][4][2]; + __declspec(align(32)) VUINT32 ExpMask[4][2]; + __declspec(align(32)) VUINT32 Two10[4][2]; + __declspec(align(32)) VUINT32 MinLog1p[4][2]; + __declspec(align(32)) VUINT32 MaxLog1p[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 SgnMask[4][2]; + __declspec(align(32)) VUINT32 XThreshold[4][2]; + __declspec(align(32)) VUINT32 XhMask[4][2]; + __declspec(align(32)) VUINT32 Threshold[4][2]; + __declspec(align(32)) VUINT32 Bias[4][2]; + __declspec(align(32)) VUINT32 Bias1[4][2]; + __declspec(align(32)) VUINT32 ExpMask0[4][2]; + __declspec(align(32)) VUINT32 ExpMask2[4][2]; + __declspec(align(32)) VUINT32 L2[4][2]; + __declspec(align(32)) VUINT32 dHalf[4][2]; + __declspec(align(32)) VUINT32 dSign[4][2]; + __declspec(align(32)) VUINT32 dTopMask12[4][2]; + __declspec(align(32)) VUINT32 dTopMask41[4][2]; + __declspec(align(32)) VUINT32 TinyRange[4][2]; +} __svml_datanh_data_internal; +#endif +__svml_datanh_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 32 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 32 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 32 + .quad 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 32 + .quad 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 32 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 32 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 32 + .quad 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 32 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 32 + .quad 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 32 + .quad 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 32 + .quad 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 32 + .quad 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 32 + .quad 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 32 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + /*== dHalf ==*/ + .align 32 + .quad 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000 + /*== dSign ==*/ + .align 32 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 + /*== dTopMask12 ==*/ + .align 32 + .quad 0xFFFFFE0000000000, 0xFFFFFE0000000000, 0xFFFFFE0000000000, 0xFFFFFE0000000000 + /*== dTopMask41 ==*/ + .align 32 + .quad 0xFFFFFFFFFFFFF000, 0xFFFFFFFFFFFFF000, 0xFFFFFFFFFFFFF000, 0xFFFFFFFFFFFFF000 + /*== dTinyRange ==*/ + .align 32 + .quad 0x0350000000000000, 0x0350000000000000, 0x0350000000000000, 0x0350000000000000 + .align 32 + .type __svml_datanh_data_internal,@object + .size __svml_datanh_data_internal,.-__svml_datanh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core-avx2.S new file mode 100644 index 0000000000..675ebd2fd6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized atanh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_atanh _ZGVeN8v_atanh_avx2_wrapper +#include "../svml_d_atanh8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core.c new file mode 100644 index 0000000000..4da8e20fad --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized atanh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_atanh +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_atanh, __GI__ZGVeN8v_atanh, __redirect__ZGVeN8v_atanh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core_avx512.S new file mode 100644 index 0000000000..ef600c073a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_atanh8_core_avx512.S @@ -0,0 +1,401 @@ +/* Function atanh vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute atanh(x) as 0.5 * log((1 + x)/(1 - x)) + * using small lookup table that map to AVX-512 permute instructions + * + * Special cases: + * + * atanh(0) = 0 + * atanh(+1) = +INF + * atanh(-1) = -INF + * atanh(x) = NaN if |x| > 1, or if x is a NaN or INF + * + */ + +/* Offsets for data table __svml_datanh_data_internal_avx512 + */ +#define Log_tbl_H 0 +#define Log_tbl_L 128 +#define One 256 +#define AbsMask 320 +#define AddB5 384 +#define RcpBitMask 448 +#define poly_coeff8 512 +#define poly_coeff7 576 +#define poly_coeff6 640 +#define poly_coeff5 704 +#define poly_coeff4 768 +#define poly_coeff3 832 +#define poly_coeff2 896 +#define poly_coeff1 960 +#define poly_coeff0 1024 +#define Half 1088 +#define L2H 1152 +#define L2L 1216 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_atanh_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups One+__svml_datanh_data_internal_avx512(%rip), %zmm15 + +/* round reciprocals to 1+4b mantissas */ + vmovups AddB5+__svml_datanh_data_internal_avx512(%rip), %zmm6 + vmovups RcpBitMask+__svml_datanh_data_internal_avx512(%rip), %zmm9 + vmovaps %zmm0, %zmm2 + vandpd AbsMask+__svml_datanh_data_internal_avx512(%rip), %zmm2, %zmm13 + +/* 1+y */ + vaddpd {rn-sae}, %zmm15, %zmm13, %zmm0 + +/* 1-y */ + vsubpd {rn-sae}, %zmm13, %zmm15, %zmm4 + vxorpd %zmm13, %zmm2, %zmm1 + +/* Yp_high */ + vsubpd {rn-sae}, %zmm15, %zmm0, %zmm7 + +/* -Ym_high */ + vsubpd {rn-sae}, %zmm15, %zmm4, %zmm12 + +/* RcpP ~ 1/Yp */ + vrcp14pd %zmm0, %zmm3 + +/* RcpM ~ 1/Ym */ + vrcp14pd %zmm4, %zmm5 + +/* input outside (-1, 1) ? */ + vcmppd $21, {sae}, %zmm15, %zmm13, %k0 + vpaddq %zmm6, %zmm3, %zmm11 + vpaddq %zmm6, %zmm5, %zmm10 + +/* Yp_low */ + vsubpd {rn-sae}, %zmm7, %zmm13, %zmm8 + vandpd %zmm9, %zmm11, %zmm14 + vandpd %zmm9, %zmm10, %zmm3 + +/* Ym_low */ + vaddpd {rn-sae}, %zmm12, %zmm13, %zmm12 + +/* Reduced argument: Rp = (RcpP*Yp - 1)+RcpP*Yp_low */ + vfmsub213pd {rn-sae}, %zmm15, %zmm14, %zmm0 + +/* Reduced argument: Rm = (RcpM*Ym - 1)+RcpM*Ym_low */ + vfmsub231pd {rn-sae}, %zmm3, %zmm4, %zmm15 + +/* exponents */ + vgetexppd {sae}, %zmm14, %zmm5 + vgetexppd {sae}, %zmm3, %zmm4 + +/* Table lookups */ + vmovups __svml_datanh_data_internal_avx512(%rip), %zmm9 + vmovups Log_tbl_H+64+__svml_datanh_data_internal_avx512(%rip), %zmm13 + vmovups Log_tbl_L+__svml_datanh_data_internal_avx512(%rip), %zmm7 + vfmadd231pd {rn-sae}, %zmm14, %zmm8, %zmm0 + vfnmadd231pd {rn-sae}, %zmm3, %zmm12, %zmm15 + +/* Prepare table index */ + vpsrlq $48, %zmm14, %zmm11 + vpsrlq $48, %zmm3, %zmm8 + vmovups Log_tbl_L+64+__svml_datanh_data_internal_avx512(%rip), %zmm14 + +/* polynomials */ + vmovups poly_coeff8+__svml_datanh_data_internal_avx512(%rip), %zmm3 + +/* Km-Kp */ + vsubpd {rn-sae}, %zmm5, %zmm4, %zmm5 + vmovups poly_coeff7+__svml_datanh_data_internal_avx512(%rip), %zmm4 + kmovw %k0, %edx + vmovaps %zmm11, %zmm10 + vmovaps %zmm4, %zmm6 + vpermi2pd %zmm13, %zmm9, %zmm10 + vpermi2pd %zmm14, %zmm7, %zmm11 + vpermt2pd %zmm13, %zmm8, %zmm9 + vpermt2pd %zmm14, %zmm8, %zmm7 + vmovups poly_coeff6+__svml_datanh_data_internal_avx512(%rip), %zmm8 + vfmadd231pd {rn-sae}, %zmm0, %zmm3, %zmm6 + vfmadd231pd {rn-sae}, %zmm15, %zmm3, %zmm4 + vmovups poly_coeff3+__svml_datanh_data_internal_avx512(%rip), %zmm13 + vmovups poly_coeff2+__svml_datanh_data_internal_avx512(%rip), %zmm14 + vfmadd213pd {rn-sae}, %zmm8, %zmm0, %zmm6 + vfmadd213pd {rn-sae}, %zmm8, %zmm15, %zmm4 + vmovups poly_coeff0+__svml_datanh_data_internal_avx512(%rip), %zmm8 + vsubpd {rn-sae}, %zmm11, %zmm7, %zmm12 + +/* table values */ + vsubpd {rn-sae}, %zmm10, %zmm9, %zmm3 + vmovups poly_coeff5+__svml_datanh_data_internal_avx512(%rip), %zmm7 + vmovups poly_coeff4+__svml_datanh_data_internal_avx512(%rip), %zmm9 + +/* K*L2H + Th */ + vmovups L2H+__svml_datanh_data_internal_avx512(%rip), %zmm10 + +/* K*L2L + Tl */ + vmovups L2L+__svml_datanh_data_internal_avx512(%rip), %zmm11 + vfmadd213pd {rn-sae}, %zmm7, %zmm0, %zmm6 + vfmadd213pd {rn-sae}, %zmm7, %zmm15, %zmm4 + vmovups poly_coeff1+__svml_datanh_data_internal_avx512(%rip), %zmm7 + vfmadd231pd {rn-sae}, %zmm5, %zmm10, %zmm3 + vfmadd213pd {rn-sae}, %zmm12, %zmm11, %zmm5 + vfmadd213pd {rn-sae}, %zmm9, %zmm0, %zmm6 + vfmadd213pd {rn-sae}, %zmm9, %zmm15, %zmm4 + vfmadd213pd {rn-sae}, %zmm13, %zmm0, %zmm6 + vfmadd213pd {rn-sae}, %zmm13, %zmm15, %zmm4 + vfmadd213pd {rn-sae}, %zmm14, %zmm0, %zmm6 + vfmadd213pd {rn-sae}, %zmm14, %zmm15, %zmm4 + vfmadd213pd {rn-sae}, %zmm7, %zmm0, %zmm6 + vfmadd213pd {rn-sae}, %zmm7, %zmm15, %zmm4 + vfmadd213pd {rn-sae}, %zmm8, %zmm0, %zmm6 + vfmadd213pd {rn-sae}, %zmm8, %zmm15, %zmm4 + +/* (K*L2L + Tl) + Rp*PolyP */ + vfmadd213pd {rn-sae}, %zmm5, %zmm0, %zmm6 + vorpd Half+__svml_datanh_data_internal_avx512(%rip), %zmm1, %zmm0 + +/* (K*L2L + Tl) + Rp*PolyP -Rm*PolyM */ + vfnmadd213pd {rn-sae}, %zmm6, %zmm15, %zmm4 + vaddpd {rn-sae}, %zmm4, %zmm3, %zmm1 + vmulpd {rn-sae}, %zmm0, %zmm1, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm2 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm2, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call atanh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_atanh_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_datanh_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl_H[16][2]; + __declspec(align(64)) VUINT32 Log_tbl_L[16][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 AbsMask[8][2]; + __declspec(align(64)) VUINT32 AddB5[8][2]; + __declspec(align(64)) VUINT32 RcpBitMask[8][2]; + __declspec(align(64)) VUINT32 poly_coeff8[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + __declspec(align(64)) VUINT32 poly_coeff0[8][2]; + __declspec(align(64)) VUINT32 Half[8][2]; + __declspec(align(64)) VUINT32 L2H[8][2]; + __declspec(align(64)) VUINT32 L2L[8][2]; + } __svml_datanh_data_internal_avx512; +#endif +__svml_datanh_data_internal_avx512: + /*== Log_tbl_H ==*/ + .quad 0x0000000000000000 + .quad 0x3faf0a30c0100000 + .quad 0x3fbe27076e2a0000 + .quad 0x3fc5ff3070a80000 + .quad 0x3fcc8ff7c79b0000 + .quad 0x3fd1675cabab8000 + .quad 0x3fd4618bc21c8000 + .quad 0x3fd739d7f6bc0000 + .quad 0x3fd9f323ecbf8000 + .quad 0x3fdc8ff7c79a8000 + .quad 0x3fdf128f5faf0000 + .quad 0x3fe0be72e4254000 + .quad 0x3fe1e85f5e704000 + .quad 0x3fe307d7334f0000 + .quad 0x3fe41d8fe8468000 + .quad 0x3fe52a2d265bc000 + /*== Log_tbl_L ==*/ + .align 64 + .quad 0x0000000000000000 + .quad 0x3d662a6617cc9717 + .quad 0x3d6e5cbd3d50fffc + .quad 0xbd6b0b0de3077d7e + .quad 0xbd697794f689f843 + .quad 0x3d630701ce63eab9 + .quad 0xbd609ec17a426426 + .quad 0xbd67fcb18ed9d603 + .quad 0x3d584bf2b68d766f + .quad 0x3d5a21ac25d81ef3 + .quad 0x3d3bb2cd720ec44c + .quad 0xbd657d49676844cc + .quad 0x3d1a07bd8b34be7c + .quad 0x3d60be1fb590a1f5 + .quad 0xbd5aa33736867a17 + .quad 0x3d46abb9df22bc57 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== AbsMask ==*/ + .align 64 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== AddB5 ==*/ + .align 64 + .quad 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000 + /*== RcpBitMask ==*/ + .align 64 + .quad 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000 + /*== poly_coeff8 ==*/ + .align 64 + .quad 0x3fbc81dd40d38142, 0x3fbc81dd40d38142, 0x3fbc81dd40d38142, 0x3fbc81dd40d38142, 0x3fbc81dd40d38142, 0x3fbc81dd40d38142, 0x3fbc81dd40d38142, 0x3fbc81dd40d38142 + /*== poly_coeff7 ==*/ + .align 64 + .quad 0xbfc0073cb82e8b70, 0xbfc0073cb82e8b70, 0xbfc0073cb82e8b70, 0xbfc0073cb82e8b70, 0xbfc0073cb82e8b70, 0xbfc0073cb82e8b70, 0xbfc0073cb82e8b70, 0xbfc0073cb82e8b70 + /*== poly_coeff6 ==*/ + .align 64 + .quad 0x3fc2492298ffdae8, 0x3fc2492298ffdae8, 0x3fc2492298ffdae8, 0x3fc2492298ffdae8, 0x3fc2492298ffdae8, 0x3fc2492298ffdae8, 0x3fc2492298ffdae8, 0x3fc2492298ffdae8 + /*== poly_coeff5 ==*/ + .align 64 + .quad 0xbfc55553f871e5c5, 0xbfc55553f871e5c5, 0xbfc55553f871e5c5, 0xbfc55553f871e5c5, 0xbfc55553f871e5c5, 0xbfc55553f871e5c5, 0xbfc55553f871e5c5, 0xbfc55553f871e5c5 + /*== poly_coeff4 ==*/ + .align 64 + .quad 0x3fc9999999cd394a, 0x3fc9999999cd394a, 0x3fc9999999cd394a, 0x3fc9999999cd394a, 0x3fc9999999cd394a, 0x3fc9999999cd394a, 0x3fc9999999cd394a, 0x3fc9999999cd394a + /*== poly_coeff3 ==*/ + .align 64 + .quad 0xbfd00000000c2a01, 0xbfd00000000c2a01, 0xbfd00000000c2a01, 0xbfd00000000c2a01, 0xbfd00000000c2a01, 0xbfd00000000c2a01, 0xbfd00000000c2a01, 0xbfd00000000c2a01 + /*== poly_coeff2 ==*/ + .align 64 + .quad 0x3fd5555555555462, 0x3fd5555555555462, 0x3fd5555555555462, 0x3fd5555555555462, 0x3fd5555555555462, 0x3fd5555555555462, 0x3fd5555555555462, 0x3fd5555555555462 + /*== poly_coeff1 ==*/ + .align 64 + .quad 0xbfdfffffffffffc5, 0xbfdfffffffffffc5, 0xbfdfffffffffffc5, 0xbfdfffffffffffc5, 0xbfdfffffffffffc5, 0xbfdfffffffffffc5, 0xbfdfffffffffffc5, 0xbfdfffffffffffc5 + /*== poly_coeff0 ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== Half ==*/ + .align 64 + .quad 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .quad 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000 + /*== L2L = log(2)_low ==*/ + .align 64 + .quad 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000 + .align 64 + .type __svml_datanh_data_internal_avx512,@object + .size __svml_datanh_data_internal_avx512,.-__svml_datanh_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core-avx2.S new file mode 100644 index 0000000000..1af3662f65 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized atanhf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_atanhf _ZGVeN16v_atanhf_avx2_wrapper +#include "../svml_s_atanhf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core.c new file mode 100644 index 0000000000..4b1190f0eb --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atanhf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_atanhf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_atanhf, __GI__ZGVeN16v_atanhf, + __redirect__ZGVeN16v_atanhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core_avx512.S new file mode 100644 index 0000000000..6c5f6a54fa --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf16_core_avx512.S @@ -0,0 +1,393 @@ +/* Function atanhf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute atanh(x) as 0.5 * log((1 + x)/(1 - x)) + * using small lookup table that map to AVX-512 permute instructions + * + * Special cases: + * + * atanh(0) = 0 + * atanh(+1) = +INF + * atanh(-1) = -INF + * atanh(x) = NaN if |x| > 1, or if x is a NaN or INF + * + */ + +/* Offsets for data table __svml_satanh_data_internal_avx512 + */ +#define Log_tbl_H 0 +#define Log_tbl_L 128 +#define One 256 +#define AbsMask 320 +#define AddB5 384 +#define RcpBitMask 448 +#define poly_coeff3 512 +#define poly_coeff2 576 +#define poly_coeff1 640 +#define poly_coeff0 704 +#define Half 768 +#define L2H 832 +#define L2L 896 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_atanhf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups One+__svml_satanh_data_internal_avx512(%rip), %zmm4 + +/* round reciprocals to 1+5b mantissas */ + vmovups AddB5+__svml_satanh_data_internal_avx512(%rip), %zmm14 + vmovups RcpBitMask+__svml_satanh_data_internal_avx512(%rip), %zmm1 + vmovaps %zmm0, %zmm11 + vandps AbsMask+__svml_satanh_data_internal_avx512(%rip), %zmm11, %zmm6 + +/* 1+y */ + vaddps {rn-sae}, %zmm4, %zmm6, %zmm9 + +/* 1-y */ + vsubps {rn-sae}, %zmm6, %zmm4, %zmm8 + vxorps %zmm6, %zmm11, %zmm10 + +/* Yp_high */ + vsubps {rn-sae}, %zmm4, %zmm9, %zmm2 + +/* -Ym_high */ + vsubps {rn-sae}, %zmm4, %zmm8, %zmm5 + +/* RcpP ~ 1/Yp */ + vrcp14ps %zmm9, %zmm12 + +/* RcpM ~ 1/Ym */ + vrcp14ps %zmm8, %zmm13 + +/* input outside (-1, 1) ? */ + vcmpps $21, {sae}, %zmm4, %zmm6, %k0 + vpaddd %zmm14, %zmm12, %zmm15 + vpaddd %zmm14, %zmm13, %zmm0 + +/* Yp_low */ + vsubps {rn-sae}, %zmm2, %zmm6, %zmm3 + vandps %zmm1, %zmm15, %zmm7 + vandps %zmm1, %zmm0, %zmm12 + +/* Ym_low */ + vaddps {rn-sae}, %zmm5, %zmm6, %zmm5 + +/* Reduced argument: Rp = (RcpP*Yp - 1)+RcpP*Yp_low */ + vfmsub213ps {rn-sae}, %zmm4, %zmm7, %zmm9 + +/* Reduced argument: Rm = (RcpM*Ym - 1)+RcpM*Ym_low */ + vfmsub231ps {rn-sae}, %zmm12, %zmm8, %zmm4 + vmovups Log_tbl_L+__svml_satanh_data_internal_avx512(%rip), %zmm8 + vmovups Log_tbl_L+64+__svml_satanh_data_internal_avx512(%rip), %zmm13 + +/* exponents */ + vgetexpps {sae}, %zmm7, %zmm15 + vfmadd231ps {rn-sae}, %zmm7, %zmm3, %zmm9 + +/* Table lookups */ + vmovups __svml_satanh_data_internal_avx512(%rip), %zmm6 + vgetexpps {sae}, %zmm12, %zmm14 + vfnmadd231ps {rn-sae}, %zmm12, %zmm5, %zmm4 + +/* Prepare table index */ + vpsrld $18, %zmm7, %zmm3 + vpsrld $18, %zmm12, %zmm2 + vmovups Log_tbl_H+64+__svml_satanh_data_internal_avx512(%rip), %zmm7 + vmovups poly_coeff1+__svml_satanh_data_internal_avx512(%rip), %zmm12 + +/* Km-Kp */ + vsubps {rn-sae}, %zmm15, %zmm14, %zmm1 + kmovw %k0, %edx + vmovaps %zmm3, %zmm0 + vpermi2ps %zmm13, %zmm8, %zmm3 + vpermt2ps %zmm13, %zmm2, %zmm8 + vpermi2ps %zmm7, %zmm6, %zmm0 + vpermt2ps %zmm7, %zmm2, %zmm6 + vsubps {rn-sae}, %zmm3, %zmm8, %zmm5 + +/* K*L2H + Th */ + vmovups L2H+__svml_satanh_data_internal_avx512(%rip), %zmm2 + +/* K*L2L + Tl */ + vmovups L2L+__svml_satanh_data_internal_avx512(%rip), %zmm3 + +/* polynomials */ + vmovups poly_coeff3+__svml_satanh_data_internal_avx512(%rip), %zmm7 + vmovups poly_coeff0+__svml_satanh_data_internal_avx512(%rip), %zmm13 + +/* table values */ + vsubps {rn-sae}, %zmm0, %zmm6, %zmm0 + vfmadd231ps {rn-sae}, %zmm1, %zmm2, %zmm0 + vfmadd213ps {rn-sae}, %zmm5, %zmm3, %zmm1 + vmovups poly_coeff2+__svml_satanh_data_internal_avx512(%rip), %zmm3 + vmovaps %zmm3, %zmm2 + vfmadd231ps {rn-sae}, %zmm9, %zmm7, %zmm2 + vfmadd231ps {rn-sae}, %zmm4, %zmm7, %zmm3 + vfmadd213ps {rn-sae}, %zmm12, %zmm9, %zmm2 + vfmadd213ps {rn-sae}, %zmm12, %zmm4, %zmm3 + vfmadd213ps {rn-sae}, %zmm13, %zmm9, %zmm2 + vfmadd213ps {rn-sae}, %zmm13, %zmm4, %zmm3 + +/* (K*L2L + Tl) + Rp*PolyP */ + vfmadd213ps {rn-sae}, %zmm1, %zmm9, %zmm2 + vorps Half+__svml_satanh_data_internal_avx512(%rip), %zmm10, %zmm9 + +/* (K*L2L + Tl) + Rp*PolyP -Rm*PolyM */ + vfnmadd213ps {rn-sae}, %zmm2, %zmm4, %zmm3 + vaddps {rn-sae}, %zmm3, %zmm0, %zmm4 + vmulps {rn-sae}, %zmm9, %zmm4, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm11 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm11, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call atanhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_atanhf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_satanh_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl_H[32][1]; + __declspec(align(64)) VUINT32 Log_tbl_L[32][1]; + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 AbsMask[16][1]; + __declspec(align(64)) VUINT32 AddB5[16][1]; + __declspec(align(64)) VUINT32 RcpBitMask[16][1]; + __declspec(align(64)) VUINT32 poly_coeff3[16][1]; + __declspec(align(64)) VUINT32 poly_coeff2[16][1]; + __declspec(align(64)) VUINT32 poly_coeff1[16][1]; + __declspec(align(64)) VUINT32 poly_coeff0[16][1]; + __declspec(align(64)) VUINT32 Half[16][1]; + __declspec(align(64)) VUINT32 L2H[16][1]; + __declspec(align(64)) VUINT32 L2L[16][1]; + } __svml_satanh_data_internal_avx512; +#endif +__svml_satanh_data_internal_avx512: + /*== Log_tbl_H ==*/ + .long 0x00000000 + .long 0x3cfc0000 + .long 0x3d780000 + .long 0x3db78000 + .long 0x3df10000 + .long 0x3e14c000 + .long 0x3e300000 + .long 0x3e4a8000 + .long 0x3e648000 + .long 0x3e7dc000 + .long 0x3e8b4000 + .long 0x3e974000 + .long 0x3ea30000 + .long 0x3eae8000 + .long 0x3eb9c000 + .long 0x3ec4e000 + .long 0x3ecfa000 + .long 0x3eda2000 + .long 0x3ee48000 + .long 0x3eeea000 + .long 0x3ef8a000 + .long 0x3f013000 + .long 0x3f05f000 + .long 0x3f0aa000 + .long 0x3f0f4000 + .long 0x3f13d000 + .long 0x3f184000 + .long 0x3f1ca000 + .long 0x3f20f000 + .long 0x3f252000 + .long 0x3f295000 + .long 0x3f2d7000 + /*== Log_tbl_L ==*/ + .align 64 + .long 0x00000000 + .long 0x3726c39e + .long 0x38a30c01 + .long 0x37528ae5 + .long 0x38e0edc5 + .long 0xb8ab41f8 + .long 0xb7cf8f58 + .long 0x3896a73d + .long 0xb5838656 + .long 0x380c36af + .long 0xb8235454 + .long 0x3862bae1 + .long 0x38c5e10e + .long 0x38dedfac + .long 0x38ebfb5e + .long 0xb8e63c9f + .long 0xb85c1340 + .long 0x38777bcd + .long 0xb6038656 + .long 0x37d40984 + .long 0xb8b85028 + .long 0xb8ad5a5a + .long 0x3865c84a + .long 0x38c3d2f5 + .long 0x383ebce1 + .long 0xb8a1ed76 + .long 0xb7a332c4 + .long 0xb779654f + .long 0xb8602f73 + .long 0x38f85db0 + .long 0x37b4996f + .long 0xb8bfb3ca + /*== One ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== AbsMask ==*/ + .align 64 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== AddB5 ==*/ + .align 64 + .long 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000 + /*== RcpBitMask ==*/ + .align 64 + .long 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000 + /*== poly_coeff3 ==*/ + .align 64 + .long 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810 + /*== poly_coeff2 ==*/ + .align 64 + .long 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e + /*== poly_coeff1 ==*/ + .align 64 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 + /*== poly_coeff0 ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== Half ==*/ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .long 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000 + /*== L2L = log(2)_low ==*/ + .align 64 + .long 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4 + .align 64 + .type __svml_satanh_data_internal_avx512,@object + .size __svml_satanh_data_internal_avx512,.-__svml_satanh_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core-sse2.S new file mode 100644 index 0000000000..b750092887 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized atanhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_atanhf _ZGVbN4v_atanhf_sse2 +#include "../svml_s_atanhf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core.c new file mode 100644 index 0000000000..46624c48cd --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atanhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_atanhf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_atanhf, __GI__ZGVbN4v_atanhf, + __redirect__ZGVbN4v_atanhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core_sse4.S new file mode 100644 index 0000000000..77e46cb5b9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf4_core_sse4.S @@ -0,0 +1,361 @@ +/* Function atanhf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute atanh(x) as 0.5 * log((1 + x)/(1 - x)) + * + * Special cases: + * + * atanh(0) = 0 + * atanh(+1) = +INF + * atanh(-1) = -INF + * atanh(x) = NaN if |x| > 1, or if x is a NaN or INF + * + */ + +/* Offsets for data table __svml_satanh_data_internal + */ +#define SgnMask 0 +#define sOne 16 +#define sPoly 32 +#define iBrkValue 160 +#define iOffExpoMask 176 +#define sHalf 192 +#define sSign 208 +#define sTopMask12 224 +#define TinyRange 240 +#define sLn2 256 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_atanhf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm5 + +/* Load constants including One = 1 */ + movups sOne+__svml_satanh_data_internal(%rip), %xmm4 + movaps %xmm5, %xmm3 + +/* Strip off the sign, so treat X as positive until right at the end */ + movups SgnMask+__svml_satanh_data_internal(%rip), %xmm7 + movaps %xmm4, %xmm8 + andps %xmm5, %xmm7 + movaps %xmm4, %xmm10 + movups sTopMask12+__svml_satanh_data_internal(%rip), %xmm11 + movaps %xmm4, %xmm14 + movaps %xmm11, %xmm9 + +/* + * Compute V = 2 * X trivially, and UHi + U_lo = 1 - X in two pieces, + * the upper part UHi being <= 12 bits long. Then we have + * atanh(X) = 1/2 * log((1 + X) / (1 - X)) = 1/2 * log1p(V / (UHi + ULo)). + */ + movaps %xmm7, %xmm12 + +/* + * Check whether |X| < 1, in which case we use the main function. + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN < 1). + */ + movaps %xmm7, %xmm6 + movaps %xmm7, %xmm2 + cmpnltps %xmm4, %xmm6 + cmpltps TinyRange+__svml_satanh_data_internal(%rip), %xmm2 + mulps %xmm5, %xmm3 + subps %xmm7, %xmm8 + addps %xmm7, %xmm12 + movmskps %xmm6, %edx + subps %xmm8, %xmm10 + addps %xmm5, %xmm3 + subps %xmm7, %xmm10 + andps %xmm8, %xmm9 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * later incorporating L into the reduced argument. + * compute 1+x as high, low parts + */ + movaps %xmm4, %xmm7 + +/* + * Now compute R = 1/(UHi+ULo) * (1 - E) and the error term E + * The first FMR is exact (we force R to 12 bits just in case it + * isn't already, to make absolutely sure), and since E is ~ 2^-12, + * the rounding error in the other one is acceptable. + */ + rcpps %xmm9, %xmm15 + subps %xmm9, %xmm8 + andps %xmm11, %xmm15 + +/* + * Split V as well into upper 12 bits and lower part, so that we can get + * a preliminary quotient estimate without rounding error. + */ + andps %xmm12, %xmm11 + mulps %xmm15, %xmm9 + addps %xmm8, %xmm10 + subps %xmm11, %xmm12 + +/* Hence get initial quotient estimate QHi + QLo = R * VHi + R * VLo */ + mulps %xmm15, %xmm11 + mulps %xmm15, %xmm10 + subps %xmm9, %xmm14 + mulps %xmm12, %xmm15 + subps %xmm10, %xmm14 + +/* Compute D = E + E^2 */ + movaps %xmm14, %xmm13 + movaps %xmm4, %xmm8 + mulps %xmm14, %xmm13 + +/* reduction: compute r,n */ + movdqu iBrkValue+__svml_satanh_data_internal(%rip), %xmm9 + addps %xmm13, %xmm14 + +/* + * Compute R * (VHi + VLo) * (1 + E + E^2) + * = R * (VHi + VLo) * (1 + D) + * = QHi + (QHi * D + QLo + QLo * D) + */ + movaps %xmm14, %xmm0 + mulps %xmm15, %xmm14 + mulps %xmm11, %xmm0 + addps %xmm14, %xmm15 + movdqu iOffExpoMask+__svml_satanh_data_internal(%rip), %xmm12 + movaps %xmm4, %xmm14 + +/* Record the sign for eventual reincorporation. */ + movups sSign+__svml_satanh_data_internal(%rip), %xmm1 + addps %xmm15, %xmm0 + +/* + * Now finally accumulate the high and low parts of the + * argument to log1p, H + L, with a final compensated summation. + */ + movaps %xmm0, %xmm6 + andps %xmm5, %xmm1 + +/* Or the sign bit in with the tiny result to handle atanh(-0) correctly */ + orps %xmm1, %xmm3 + addps %xmm11, %xmm6 + maxps %xmm6, %xmm7 + minps %xmm6, %xmm8 + subps %xmm6, %xmm11 + movaps %xmm7, %xmm10 + andps %xmm2, %xmm3 + addps %xmm8, %xmm10 + addps %xmm11, %xmm0 + subps %xmm10, %xmm7 + psubd %xmm9, %xmm10 + addps %xmm7, %xmm8 + pand %xmm10, %xmm12 + psrad $23, %xmm10 + cvtdq2ps %xmm10, %xmm13 + addps %xmm8, %xmm0 + +/* final reconstruction */ + mulps sLn2+__svml_satanh_data_internal(%rip), %xmm13 + pslld $23, %xmm10 + paddd %xmm9, %xmm12 + psubd %xmm10, %xmm14 + +/* polynomial evaluation */ + subps %xmm4, %xmm12 + mulps %xmm0, %xmm14 + movups sPoly+112+__svml_satanh_data_internal(%rip), %xmm0 + addps %xmm12, %xmm14 + mulps %xmm14, %xmm0 + +/* Finally, halve the result and reincorporate the sign */ + movups sHalf+__svml_satanh_data_internal(%rip), %xmm4 + pxor %xmm1, %xmm4 + addps sPoly+96+__svml_satanh_data_internal(%rip), %xmm0 + mulps %xmm14, %xmm0 + addps sPoly+80+__svml_satanh_data_internal(%rip), %xmm0 + mulps %xmm14, %xmm0 + addps sPoly+64+__svml_satanh_data_internal(%rip), %xmm0 + mulps %xmm14, %xmm0 + addps sPoly+48+__svml_satanh_data_internal(%rip), %xmm0 + mulps %xmm14, %xmm0 + addps sPoly+32+__svml_satanh_data_internal(%rip), %xmm0 + mulps %xmm14, %xmm0 + addps sPoly+16+__svml_satanh_data_internal(%rip), %xmm0 + mulps %xmm14, %xmm0 + addps sPoly+__svml_satanh_data_internal(%rip), %xmm0 + mulps %xmm14, %xmm0 + mulps %xmm14, %xmm0 + addps %xmm0, %xmm14 + movaps %xmm2, %xmm0 + addps %xmm13, %xmm14 + mulps %xmm14, %xmm4 + andnps %xmm4, %xmm0 + orps %xmm3, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm5 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm5, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call atanhf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_atanhf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_satanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 SgnMask[4][1]; + __declspec(align(16)) VUINT32 sOne[4][1]; + __declspec(align(16)) VUINT32 sPoly[8][4][1]; + __declspec(align(16)) VUINT32 iBrkValue[4][1]; + __declspec(align(16)) VUINT32 iOffExpoMask[4][1]; + __declspec(align(16)) VUINT32 sHalf[4][1]; + __declspec(align(16)) VUINT32 sSign[4][1]; + __declspec(align(16)) VUINT32 sTopMask12[4][1]; + __declspec(align(16)) VUINT32 TinyRange[4][1]; + __declspec(align(16)) VUINT32 sLn2[4][1]; +} __svml_satanh_data_internal; +#endif +__svml_satanh_data_internal: + /*== SgnMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== sOne = SP 1.0 ==*/ + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 16 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iBrkValue = SP 2/3 ==*/ + .align 16 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 16 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sHalf ==*/ + .align 16 + .long 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000 + /*== sSign ==*/ + .align 16 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 + /*== sTopMask12 ==*/ + .align 16 + .long 0xFFFFF000, 0xFFFFF000, 0xFFFFF000, 0xFFFFF000 + /*== TinyRange ==*/ + .align 16 + .long 0x0C000000, 0x0C000000, 0x0C000000, 0x0C000000 + /*== sLn2 = SP ln(2) ==*/ + .align 16 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 16 + .type __svml_satanh_data_internal,@object + .size __svml_satanh_data_internal,.-__svml_satanh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core-sse.S new file mode 100644 index 0000000000..b293bd5b41 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized atanhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_atanhf _ZGVdN8v_atanhf_sse_wrapper +#include "../svml_s_atanhf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core.c new file mode 100644 index 0000000000..3df8d66c94 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized atanhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_atanhf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_atanhf, __GI__ZGVdN8v_atanhf, + __redirect__ZGVdN8v_atanhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core_avx2.S new file mode 100644 index 0000000000..00225207a8 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_atanhf8_core_avx2.S @@ -0,0 +1,335 @@ +/* Function atanhf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute atanh(x) as 0.5 * log((1 + x)/(1 - x)) + * + * Special cases: + * + * atanh(0) = 0 + * atanh(+1) = +INF + * atanh(-1) = -INF + * atanh(x) = NaN if |x| > 1, or if x is a NaN or INF + * + */ + +/* Offsets for data table __svml_satanh_data_internal + */ +#define SgnMask 0 +#define sOne 32 +#define sPoly 64 +#define iBrkValue 320 +#define iOffExpoMask 352 +#define sHalf 384 +#define sSign 416 +#define sTopMask12 448 +#define TinyRange 480 +#define sLn2 512 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_atanhf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + +/* Load constants including One = 1 */ + vmovups sOne+__svml_satanh_data_internal(%rip), %ymm5 + vmovups sTopMask12+__svml_satanh_data_internal(%rip), %ymm13 + vmovaps %ymm0, %ymm6 + +/* Strip off the sign, so treat X as positive until right at the end */ + vandps SgnMask+__svml_satanh_data_internal(%rip), %ymm6, %ymm10 + vsubps %ymm10, %ymm5, %ymm1 + +/* + * Compute V = 2 * X trivially, and UHi + U_lo = 1 - X in two pieces, + * the upper part UHi being <= 12 bits long. Then we have + * atanh(X) = 1/2 * log((1 + X) / (1 - X)) = 1/2 * log1p(V / (UHi + ULo)). + */ + vaddps %ymm10, %ymm10, %ymm14 + +/* + * Check whether |X| < 1, in which case we use the main function. + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN < 1). + */ + vcmpnlt_uqps %ymm5, %ymm10, %ymm7 + vsubps %ymm1, %ymm5, %ymm9 + vcmplt_oqps TinyRange+__svml_satanh_data_internal(%rip), %ymm10, %ymm4 + vrcpps %ymm1, %ymm11 + vsubps %ymm10, %ymm9, %ymm12 + vandps %ymm13, %ymm11, %ymm0 + +/* No need to split sU when FMA is available */ + vfnmadd213ps %ymm5, %ymm0, %ymm1 + vmovaps %ymm6, %ymm8 + vfmadd213ps %ymm6, %ymm6, %ymm8 + vfnmadd231ps %ymm0, %ymm12, %ymm1 + +/* + * Split V as well into upper 12 bits and lower part, so that we can get + * a preliminary quotient estimate without rounding error. + */ + vandps %ymm13, %ymm14, %ymm15 + vmovmskps %ymm7, %edx + vsubps %ymm15, %ymm14, %ymm7 + +/* Hence get initial quotient estimate QHi + QLo = R * VHi + R * VLo */ + vmulps %ymm15, %ymm0, %ymm10 + +/* Compute D = E + E^2 */ + vfmadd213ps %ymm1, %ymm1, %ymm1 + +/* Record the sign for eventual reincorporation. */ + vandps sSign+__svml_satanh_data_internal(%rip), %ymm6, %ymm3 + +/* Or the sign bit in with the tiny result to handle atanh(-0) correctly */ + vorps %ymm3, %ymm8, %ymm2 + vmulps %ymm7, %ymm0, %ymm8 + +/* + * Compute R * (VHi + VLo) * (1 + E + E^2) + * = R * (VHi + VLo) * (1 + D) + * = QHi + (QHi * D + QLo + QLo * D) + */ + vmulps %ymm1, %ymm10, %ymm9 + vfmadd213ps %ymm8, %ymm8, %ymm1 + vaddps %ymm1, %ymm9, %ymm1 + +/* reduction: compute r,n */ + vmovups iBrkValue+__svml_satanh_data_internal(%rip), %ymm9 + +/* + * Now finally accumulate the high and low parts of the + * argument to log1p, H + L, with a final compensated summation. + */ + vaddps %ymm1, %ymm10, %ymm12 + vsubps %ymm12, %ymm10, %ymm11 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * later incorporating L into the reduced argument. + * compute 1+x as high, low parts + */ + vmaxps %ymm12, %ymm5, %ymm13 + vminps %ymm12, %ymm5, %ymm14 + vaddps %ymm11, %ymm1, %ymm0 + vaddps %ymm14, %ymm13, %ymm1 + vpsubd %ymm9, %ymm1, %ymm7 + vsubps %ymm1, %ymm13, %ymm15 + vpsrad $23, %ymm7, %ymm10 + vpand iOffExpoMask+__svml_satanh_data_internal(%rip), %ymm7, %ymm8 + vaddps %ymm15, %ymm14, %ymm13 + vpslld $23, %ymm10, %ymm11 + vpaddd %ymm9, %ymm8, %ymm15 + vaddps %ymm13, %ymm0, %ymm14 + vcvtdq2ps %ymm10, %ymm0 + vpsubd %ymm11, %ymm5, %ymm12 + +/* polynomial evaluation */ + vsubps %ymm5, %ymm15, %ymm5 + vmulps %ymm14, %ymm12, %ymm1 + vaddps %ymm5, %ymm1, %ymm5 + vmovups sPoly+224+__svml_satanh_data_internal(%rip), %ymm1 + vfmadd213ps sPoly+192+__svml_satanh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213ps sPoly+160+__svml_satanh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213ps sPoly+128+__svml_satanh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213ps sPoly+96+__svml_satanh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213ps sPoly+64+__svml_satanh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213ps sPoly+32+__svml_satanh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213ps sPoly+__svml_satanh_data_internal(%rip), %ymm5, %ymm1 + vmulps %ymm1, %ymm5, %ymm7 + vfmadd213ps %ymm5, %ymm5, %ymm7 + +/* final reconstruction */ + vfmadd132ps sLn2+__svml_satanh_data_internal(%rip), %ymm7, %ymm0 + +/* Finally, halve the result and reincorporate the sign */ + vxorps sHalf+__svml_satanh_data_internal(%rip), %ymm3, %ymm3 + vmulps %ymm0, %ymm3, %ymm0 + vblendvps %ymm4, %ymm2, %ymm0, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm6 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm6, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call atanhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_atanhf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_satanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 SgnMask[8][1]; + __declspec(align(32)) VUINT32 sOne[8][1]; + __declspec(align(32)) VUINT32 sPoly[8][8][1]; + __declspec(align(32)) VUINT32 iBrkValue[8][1]; + __declspec(align(32)) VUINT32 iOffExpoMask[8][1]; + __declspec(align(32)) VUINT32 sHalf[8][1]; + __declspec(align(32)) VUINT32 sSign[8][1]; + __declspec(align(32)) VUINT32 sTopMask12[8][1]; + __declspec(align(32)) VUINT32 TinyRange[8][1]; + __declspec(align(32)) VUINT32 sLn2[8][1]; +} __svml_satanh_data_internal; +#endif +__svml_satanh_data_internal: + /*== SgnMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== sOne = SP 1.0 ==*/ + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 32 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iBrkValue = SP 2/3 ==*/ + .align 32 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 32 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sHalf ==*/ + .align 32 + .long 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000 + /*== sSign ==*/ + .align 32 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 + /*== sTopMask12 ==*/ + .align 32 + .long 0xFFFFF000, 0xFFFFF000, 0xFFFFF000, 0xFFFFF000, 0xFFFFF000, 0xFFFFF000, 0xFFFFF000, 0xFFFFF000 + /*== TinyRange ==*/ + .align 32 + .long 0x0C000000, 0x0C000000, 0x0C000000, 0x0C000000, 0x0C000000, 0x0C000000, 0x0C000000, 0x0C000000 + /*== sLn2 = SP ln(2) ==*/ + .align 32 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 32 + .type __svml_satanh_data_internal,@object + .size __svml_satanh_data_internal,.-__svml_satanh_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_atanh2_core.S b/sysdeps/x86_64/fpu/svml_d_atanh2_core.S new file mode 100644 index 0000000000..36f549ddd9 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atanh2_core.S @@ -0,0 +1,29 @@ +/* Function atanh vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_atanh) +WRAPPER_IMPL_SSE2 atanh +END (_ZGVbN2v_atanh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_atanh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_atanh4_core.S b/sysdeps/x86_64/fpu/svml_d_atanh4_core.S new file mode 100644 index 0000000000..6d6d11e85e --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atanh4_core.S @@ -0,0 +1,29 @@ +/* Function atanh vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_atanh) +WRAPPER_IMPL_AVX _ZGVbN2v_atanh +END (_ZGVdN4v_atanh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_atanh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_atanh4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_atanh4_core_avx.S new file mode 100644 index 0000000000..b4cfa275c8 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atanh4_core_avx.S @@ -0,0 +1,25 @@ +/* Function atanh vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_atanh) +WRAPPER_IMPL_AVX _ZGVbN2v_atanh +END (_ZGVcN4v_atanh) diff --git a/sysdeps/x86_64/fpu/svml_d_atanh8_core.S b/sysdeps/x86_64/fpu/svml_d_atanh8_core.S new file mode 100644 index 0000000000..b31a6a72a1 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_atanh8_core.S @@ -0,0 +1,25 @@ +/* Function atanh vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_atanh) +WRAPPER_IMPL_AVX512 _ZGVdN4v_atanh +END (_ZGVeN8v_atanh) diff --git a/sysdeps/x86_64/fpu/svml_s_atanhf16_core.S b/sysdeps/x86_64/fpu/svml_s_atanhf16_core.S new file mode 100644 index 0000000000..2ea61888e7 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanhf16_core.S @@ -0,0 +1,25 @@ +/* Function atanhf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_atanhf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_atanhf +END (_ZGVeN16v_atanhf) diff --git a/sysdeps/x86_64/fpu/svml_s_atanhf4_core.S b/sysdeps/x86_64/fpu/svml_s_atanhf4_core.S new file mode 100644 index 0000000000..6904cc388a --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanhf4_core.S @@ -0,0 +1,29 @@ +/* Function atanhf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_atanhf) +WRAPPER_IMPL_SSE2 atanhf +END (_ZGVbN4v_atanhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_atanhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_atanhf8_core.S b/sysdeps/x86_64/fpu/svml_s_atanhf8_core.S new file mode 100644 index 0000000000..31d695fb5d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanhf8_core.S @@ -0,0 +1,29 @@ +/* Function atanhf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_atanhf) +WRAPPER_IMPL_AVX _ZGVbN4v_atanhf +END (_ZGVdN8v_atanhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_atanhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_atanhf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_atanhf8_core_avx.S new file mode 100644 index 0000000000..6c24eaf45c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_atanhf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function atanhf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_atanhf) +WRAPPER_IMPL_AVX _ZGVbN4v_atanhf +END (_ZGVcN8v_atanhf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx.c new file mode 100644 index 0000000000..0bdeec7851 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atanh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx2.c new file mode 100644 index 0000000000..0bdeec7851 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atanh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx512f.c new file mode 100644 index 0000000000..0bdeec7851 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atanh-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-atanh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-atanh.c b/sysdeps/x86_64/fpu/test-double-libmvec-atanh.c new file mode 100644 index 0000000000..41dd8e7af3 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-atanh.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC atanh +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 38359b05e3..04a4fe654b 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVbN2vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVbN2v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVbN2v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVbN2v_log1p) +VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVbN2v_atanh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 17701e7731..f9ac2fad5d 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVdN4vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVdN4v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVdN4v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVdN4v_log1p) +VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVdN4v_atanh) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index bba62b2446..185801fa82 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVcN4vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVcN4v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVcN4v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVcN4v_log1p) +VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVcN4v_atanh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 8a04e13a07..1cc8aaecbf 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2), _ZGVeN8vv_atan2) VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVeN8v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVeN8v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVeN8v_log1p) +VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVeN8v_atanh) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx.c new file mode 100644 index 0000000000..6f89ae70f2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atanhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx2.c new file mode 100644 index 0000000000..6f89ae70f2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atanhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx512f.c new file mode 100644 index 0000000000..6f89ae70f2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-atanhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-atanhf.c b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf.c new file mode 100644 index 0000000000..33a022adb8 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-atanhf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC atanhf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 706f52c618..b5d76d80e0 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVeN16vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVeN16v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVeN16v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVeN16v_log1pf) +VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVeN16v_atanhf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index ceace4c53a..c1df6a03c1 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVbN4vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVbN4v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVbN4v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVbN4v_log1pf) +VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVbN4v_atanhf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 06a4753409..f4c646683f 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVdN8vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVdN8v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVdN8v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVdN8v_log1pf) +VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVdN8v_atanhf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index a87e5298e0..a6acd3ffca 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -41,6 +41,7 @@ VECTOR_WRAPPER_ff (WRAPPER_NAME (atan2f), _ZGVcN8vv_atan2f) VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVcN8v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVcN8v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVcN8v_log1pf) +VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVcN8v_atanhf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573819 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=CvaBkqai; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm9R1PzWz9sVq for ; Wed, 29 Dec 2021 07:19:55 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DD8EA3858417 for ; Tue, 28 Dec 2021 20:19:52 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DD8EA3858417 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722792; bh=5o/+/cXmMhAC1Gg3oLb1lK0Acgtcl/UR8EwqXOOvq5M=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=CvaBkqaiX/4KV4t0mgXidWUHwDEiQmNs+1pGJTLtF1nzFZ6ArIB29G0ah0zl1tbR5 466m6uhpkAxEFkWs7KHjY//7OuAX/Xz6np9anAOMxD9b0VnHmgGf7pwvW/WKUYPw+/ HXZdJLHCLHscQz2OhjTnz1Z4CPX4gbYjubiLOIYQ= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id 912053858D39 for ; Tue, 28 Dec 2021 20:11:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 912053858D39 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="228246035" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="228246035" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="572406130" Received: from scymds01.sc.intel.com ([10.148.94.138]) by fmsmga008.fm.intel.com with ESMTP; 28 Dec 2021 12:11:33 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsj016522; Tue, 28 Dec 2021 12:11:32 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 15/18] x86-64: Add vector acosh/acoshf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:27 -0800 Message-Id: <20211228201130.737370-16-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized acosh/acoshf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector acosh/acoshf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_acosh2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_acosh2_core.c | 27 + .../fpu/multiarch/svml_d_acosh2_core_sse4.S | 1466 ++++++++++++++++ .../fpu/multiarch/svml_d_acosh4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_acosh4_core.c | 27 + .../fpu/multiarch/svml_d_acosh4_core_avx2.S | 1533 +++++++++++++++++ .../fpu/multiarch/svml_d_acosh8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_acosh8_core.c | 27 + .../fpu/multiarch/svml_d_acosh8_core_avx512.S | 480 ++++++ .../fpu/multiarch/svml_s_acoshf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_acoshf16_core.c | 28 + .../multiarch/svml_s_acoshf16_core_avx512.S | 449 +++++ .../fpu/multiarch/svml_s_acoshf4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_acoshf4_core.c | 28 + .../fpu/multiarch/svml_s_acoshf4_core_sse4.S | 389 +++++ .../fpu/multiarch/svml_s_acoshf8_core-sse.S | 20 + .../fpu/multiarch/svml_s_acoshf8_core.c | 28 + .../fpu/multiarch/svml_s_acoshf8_core_avx2.S | 370 ++++ sysdeps/x86_64/fpu/svml_d_acosh2_core.S | 29 + sysdeps/x86_64/fpu/svml_d_acosh4_core.S | 29 + sysdeps/x86_64/fpu/svml_d_acosh4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_acosh8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_acoshf16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_acoshf4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_acoshf8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_acoshf8_core_avx.S | 25 + .../fpu/test-double-libmvec-acosh-avx.c | 1 + .../fpu/test-double-libmvec-acosh-avx2.c | 1 + .../fpu/test-double-libmvec-acosh-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-acosh.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-acoshf-avx.c | 1 + .../fpu/test-float-libmvec-acoshf-avx2.c | 1 + .../fpu/test-float-libmvec-acoshf-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-acoshf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 5259 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_acosh2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_acosh4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_acosh4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_acosh8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_acoshf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_acoshf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_acoshf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_acoshf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-acosh.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-acoshf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index bb7380a446..b17bf78cd9 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -263,4 +263,15 @@ #define __DECL_SIMD_atanhf32x #define __DECL_SIMD_atanhf64x #define __DECL_SIMD_atanhf128x + +#define __DECL_SIMD_acosh +#define __DECL_SIMD_acoshf +#define __DECL_SIMD_acoshl +#define __DECL_SIMD_acoshf16 +#define __DECL_SIMD_acoshf32 +#define __DECL_SIMD_acoshf64 +#define __DECL_SIMD_acoshf128 +#define __DECL_SIMD_acoshf32x +#define __DECL_SIMD_acoshf64x +#define __DECL_SIMD_acoshf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 04dd9c5d1b..bc37973c41 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -82,7 +82,7 @@ __MATHDECL_VEC (void,sincos,, #if defined __USE_XOPEN_EXTENDED || defined __USE_ISOC99 /* Hyperbolic arc cosine of X. */ -__MATHCALL (acosh,, (_Mdouble_ __x)); +__MATHCALL_VEC (acosh,, (_Mdouble_ __x)); /* Hyperbolic arc sine of X. */ __MATHCALL (asinh,, (_Mdouble_ __x)); /* Hyperbolic arc tangent of X. */ diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 2d389912b1..e9d6ade70a 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -47,6 +47,7 @@ GLIBC_2.22 _ZGVeN8v_sin F GLIBC_2.22 _ZGVeN8vv_pow F GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F +GLIBC_2.35 _ZGVbN2v_acosh F GLIBC_2.35 _ZGVbN2v_asin F GLIBC_2.35 _ZGVbN2v_atan F GLIBC_2.35 _ZGVbN2v_atanh F @@ -62,6 +63,7 @@ GLIBC_2.35 _ZGVbN2v_sinh F GLIBC_2.35 _ZGVbN2vv_atan2 F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F +GLIBC_2.35 _ZGVbN4v_acoshf F GLIBC_2.35 _ZGVbN4v_asinf F GLIBC_2.35 _ZGVbN4v_atanf F GLIBC_2.35 _ZGVbN4v_atanhf F @@ -77,6 +79,7 @@ GLIBC_2.35 _ZGVbN4v_sinhf F GLIBC_2.35 _ZGVbN4vv_atan2f F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F +GLIBC_2.35 _ZGVcN4v_acosh F GLIBC_2.35 _ZGVcN4v_asin F GLIBC_2.35 _ZGVcN4v_atan F GLIBC_2.35 _ZGVcN4v_atanh F @@ -92,6 +95,7 @@ GLIBC_2.35 _ZGVcN4v_sinh F GLIBC_2.35 _ZGVcN4vv_atan2 F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F +GLIBC_2.35 _ZGVcN8v_acoshf F GLIBC_2.35 _ZGVcN8v_asinf F GLIBC_2.35 _ZGVcN8v_atanf F GLIBC_2.35 _ZGVcN8v_atanhf F @@ -107,6 +111,7 @@ GLIBC_2.35 _ZGVcN8v_sinhf F GLIBC_2.35 _ZGVcN8vv_atan2f F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F +GLIBC_2.35 _ZGVdN4v_acosh F GLIBC_2.35 _ZGVdN4v_asin F GLIBC_2.35 _ZGVdN4v_atan F GLIBC_2.35 _ZGVdN4v_atanh F @@ -122,6 +127,7 @@ GLIBC_2.35 _ZGVdN4v_sinh F GLIBC_2.35 _ZGVdN4vv_atan2 F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F +GLIBC_2.35 _ZGVdN8v_acoshf F GLIBC_2.35 _ZGVdN8v_asinf F GLIBC_2.35 _ZGVdN8v_atanf F GLIBC_2.35 _ZGVdN8v_atanhf F @@ -137,6 +143,7 @@ GLIBC_2.35 _ZGVdN8v_sinhf F GLIBC_2.35 _ZGVdN8vv_atan2f F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F +GLIBC_2.35 _ZGVeN16v_acoshf F GLIBC_2.35 _ZGVeN16v_asinf F GLIBC_2.35 _ZGVeN16v_atanf F GLIBC_2.35 _ZGVeN16v_atanhf F @@ -152,6 +159,7 @@ GLIBC_2.35 _ZGVeN16v_sinhf F GLIBC_2.35 _ZGVeN16vv_atan2f F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F +GLIBC_2.35 _ZGVeN8v_acosh F GLIBC_2.35 _ZGVeN8v_asin F GLIBC_2.35 _ZGVeN8v_atan F GLIBC_2.35 _ZGVeN8v_atanh F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 4937b6811f..4ad12a33e5 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -118,6 +118,10 @@ # define __DECL_SIMD_atanh __DECL_SIMD_x86_64 # undef __DECL_SIMD_atanhf # define __DECL_SIMD_atanhf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_acosh +# define __DECL_SIMD_acosh __DECL_SIMD_x86_64 +# undef __DECL_SIMD_acoshf +# define __DECL_SIMD_acoshf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index da39c08ba9..503547d3e4 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -58,6 +58,8 @@ !GCC$ builtin (log1pf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (atanh) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (atanhf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (acosh) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (acoshf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -101,3 +103,5 @@ !GCC$ builtin (log1pf) attributes simd (notinbranch) if('x32') !GCC$ builtin (atanh) attributes simd (notinbranch) if('x32') !GCC$ builtin (atanhf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (acosh) attributes simd (notinbranch) if('x32') +!GCC$ builtin (acoshf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index de87544259..7b90b3d049 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -23,6 +23,7 @@ postclean-generated += libmvec.mk # Define for both math and mathvec directories. libmvec-funcs = \ acos \ + acosh \ asin \ atan \ atan2 \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index df0ea83711..fd5e5923a1 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -15,6 +15,7 @@ libmvec { } GLIBC_2.35 { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; + _ZGVbN2v_acosh; _ZGVcN4v_acosh; _ZGVdN4v_acosh; _ZGVeN8v_acosh; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; _ZGVbN2v_atanh; _ZGVcN4v_atanh; _ZGVdN4v_atanh; _ZGVeN8v_atanh; @@ -30,6 +31,7 @@ libmvec { _ZGVbN2vv_atan2; _ZGVcN4vv_atan2; _ZGVdN4vv_atan2; _ZGVeN8vv_atan2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; + _ZGVbN4v_acoshf; _ZGVcN8v_acoshf; _ZGVdN8v_acoshf; _ZGVeN16v_acoshf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; _ZGVbN4v_atanhf; _ZGVcN8v_atanhf; _ZGVdN8v_atanhf; _ZGVeN16v_atanhf; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index 09a46190b6..b2aa8fc56e 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -69,6 +69,26 @@ float: 2 float128: 3 ldouble: 3 +Function: "acosh_vlen16": +float: 1 + +Function: "acosh_vlen2": +double: 2 + +Function: "acosh_vlen4": +double: 2 +float: 1 + +Function: "acosh_vlen4_avx2": +double: 2 + +Function: "acosh_vlen8": +double: 1 +float: 1 + +Function: "acosh_vlen8_avx2": +float: 2 + Function: "asin": double: 1 float: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core-sse2.S new file mode 100644 index 0000000000..28620a03a9 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized acosh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_acosh _ZGVbN2v_acosh_sse2 +#include "../svml_d_acosh2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core.c new file mode 100644 index 0000000000..8a41507326 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized acosh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_acosh +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_acosh, __GI__ZGVbN2v_acosh, __redirect__ZGVbN2v_acosh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core_sse4.S new file mode 100644 index 0000000000..687e3d1a8b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh2_core_sse4.S @@ -0,0 +1,1466 @@ +/* Function acosh vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute acosh(x) as log(x + sqrt(x*x - 1)) + * + * Special cases: + * + * acosh(NaN) = quiet NaN, and raise invalid exception + * acosh(-INF) = NaN + * acosh(+INF) = +INF + * acosh(x) = NaN if x < 1 + * acosh(1) = +0 + * + */ + +/* Offsets for data table __svml_dacosh_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8208 +#define poly_coeff 12320 +#define ExpMask 12384 +#define Two10 12400 +#define MinLog1p 12416 +#define MaxLog1p 12432 +#define One 12448 +#define SgnMask 12464 +#define XThreshold 12480 +#define XhMask 12496 +#define Threshold 12512 +#define Bias 12528 +#define Bias1 12544 +#define ExpMask0 12560 +#define ExpMask2 12576 +#define L2 12592 +#define dBigThreshold 12608 +#define dLargestFinite 12624 +#define dThirtyOne 12640 +#define XScale 12656 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_acosh_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + movaps %xmm0, %xmm7 + +/* Load the constant 1 and possibly other stuff */ + movups One+__svml_dacosh_data_internal(%rip), %xmm6 + +/* Compute U = X - 1 and V = X + 1, naively first. */ + movaps %xmm7, %xmm11 + movaps %xmm6, %xmm10 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + movaps %xmm6, %xmm14 + subpd %xmm6, %xmm11 + addpd %xmm7, %xmm10 + +/* For low-accuracy versions, naivety is harmless */ + mulpd %xmm11, %xmm10 + +/* dH = [X + sqrt(X^2 - 1)] - 1 */ + sqrtpd %xmm10, %xmm13 + addpd %xmm11, %xmm13 + maxpd %xmm13, %xmm14 + movaps %xmm6, %xmm4 + +/* + * The following computation can go wrong for very large X, e.g. + * the X^2 - 1 = U * V can overflow. But for large X we have + * acosh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when to do this. + */ + movaps %xmm7, %xmm5 + minpd %xmm13, %xmm4 + cmpltpd dBigThreshold+__svml_dacosh_data_internal(%rip), %xmm5 + movups SgnMask+__svml_dacosh_data_internal(%rip), %xmm12 + movaps %xmm14, %xmm0 + +/* Now multiplex to the case X = 2^-30 * input, Xl = dL = 0 in the "big" case. */ + movups XScale+__svml_dacosh_data_internal(%rip), %xmm15 + andps %xmm12, %xmm13 + mulpd %xmm7, %xmm15 + cmpltpd XThreshold+__svml_dacosh_data_internal(%rip), %xmm13 + addpd %xmm4, %xmm0 + orps XhMask+__svml_dacosh_data_internal(%rip), %xmm13 + movaps %xmm5, %xmm3 + andps %xmm13, %xmm0 + andnps %xmm15, %xmm3 + subpd %xmm0, %xmm14 + andps %xmm5, %xmm0 + +/* + * Check that 1 < X < +inf; otherwise go to the callout function. + * We need the callout for X = 1 to avoid division by zero below. + * This test ensures that callout handles NaN and either infinity. + */ + movaps %xmm7, %xmm9 + +/* Now resume the main code. */ + movups ExpMask+__svml_dacosh_data_internal(%rip), %xmm1 + orps %xmm0, %xmm3 + +/* preserve mantissa, set input exponent to 2^(-10) */ + andps %xmm3, %xmm1 + movaps %xmm6, %xmm8 + orps Two10+__svml_dacosh_data_internal(%rip), %xmm1 + +/* exponent bits */ + movaps %xmm3, %xmm11 + +/* reciprocal approximation good to at least 11 bits */ + cvtpd2ps %xmm1, %xmm2 + cmpnlepd dLargestFinite+__svml_dacosh_data_internal(%rip), %xmm9 + cmpnltpd %xmm7, %xmm8 + addpd %xmm14, %xmm4 + movlhps %xmm2, %xmm2 + orps %xmm8, %xmm9 + rcpps %xmm2, %xmm8 + movmskpd %xmm9, %edx + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + movups .FLT_20(%rip), %xmm10 + andps %xmm5, %xmm4 + +/* exponent of X needed to scale Xl */ + movdqu ExpMask0+__svml_dacosh_data_internal(%rip), %xmm9 + psrlq $20, %xmm11 + cvtps2pd %xmm8, %xmm1 + addpd %xmm10, %xmm1 + subpd %xmm10, %xmm1 + +/* 2^ (-10-exp(X) ) */ + movdqu ExpMask2+__svml_dacosh_data_internal(%rip), %xmm2 + pand %xmm3, %xmm9 + psubq %xmm9, %xmm2 + +/* scale DblRcp */ + mulpd %xmm1, %xmm2 + +/* argument reduction */ + mulpd %xmm2, %xmm3 + mulpd %xmm2, %xmm4 + subpd %xmm6, %xmm3 + movaps %xmm3, %xmm2 + movaps %xmm5, %xmm0 + addpd %xmm4, %xmm2 + pshufd $221, %xmm11, %xmm12 + movaps %xmm2, %xmm6 + +/* biased exponent in DP format */ + cvtdq2pd %xmm12, %xmm14 + subpd %xmm3, %xmm6 + +/* polynomial */ + movups poly_coeff+__svml_dacosh_data_internal(%rip), %xmm3 + lea -4218864+__svml_dacosh_data_internal(%rip), %rsi + mulpd %xmm2, %xmm3 + subpd %xmm6, %xmm4 + addpd poly_coeff+16+__svml_dacosh_data_internal(%rip), %xmm3 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + movups dThirtyOne+__svml_dacosh_data_internal(%rip), %xmm13 + +/* exponent*log(2.0) */ + movups Threshold+__svml_dacosh_data_internal(%rip), %xmm8 + addpd %xmm14, %xmm13 + cmpltpd %xmm1, %xmm8 + andps %xmm5, %xmm14 + +/* + * prepare table index + * table lookup + */ + movaps %xmm1, %xmm5 + movaps %xmm2, %xmm1 + andnps %xmm13, %xmm0 + mulpd %xmm2, %xmm1 + movups poly_coeff+32+__svml_dacosh_data_internal(%rip), %xmm6 + psrlq $40, %xmm5 + mulpd %xmm2, %xmm6 + mulpd %xmm1, %xmm3 + addpd poly_coeff+48+__svml_dacosh_data_internal(%rip), %xmm6 + movd %xmm5, %eax + andps Bias+__svml_dacosh_data_internal(%rip), %xmm8 + orps %xmm14, %xmm0 + addpd %xmm3, %xmm6 + +/* + * reconstruction + * VQFMA( D, R, P, R2, R ); + */ + mulpd %xmm6, %xmm1 + addpd %xmm1, %xmm4 + orps Bias1+__svml_dacosh_data_internal(%rip), %xmm8 + pshufd $2, %xmm5, %xmm15 + subpd %xmm8, %xmm0 + addpd %xmm4, %xmm2 + movd %xmm15, %ecx + mulpd L2+__svml_dacosh_data_internal(%rip), %xmm0 + movslq %eax, %rax + movslq %ecx, %rcx + movsd (%rsi,%rax), %xmm9 + movhpd (%rsi,%rcx), %xmm9 + addpd %xmm2, %xmm9 + addpd %xmm9, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm0 xmm7 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm7, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call acosh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN2v_acosh_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dacosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(16)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(16)) VUINT32 poly_coeff[4][2][2]; + __declspec(align(16)) VUINT32 ExpMask[2][2]; + __declspec(align(16)) VUINT32 Two10[2][2]; + __declspec(align(16)) VUINT32 MinLog1p[2][2]; + __declspec(align(16)) VUINT32 MaxLog1p[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 SgnMask[2][2]; + __declspec(align(16)) VUINT32 XThreshold[2][2]; + __declspec(align(16)) VUINT32 XhMask[2][2]; + __declspec(align(16)) VUINT32 Threshold[2][2]; + __declspec(align(16)) VUINT32 Bias[2][2]; + __declspec(align(16)) VUINT32 Bias1[2][2]; + __declspec(align(16)) VUINT32 ExpMask0[2][2]; + __declspec(align(16)) VUINT32 ExpMask2[2][2]; + __declspec(align(16)) VUINT32 L2[2][2]; + __declspec(align(16)) VUINT32 dBigThreshold[2][2]; + __declspec(align(16)) VUINT32 dLargestFinite[2][2]; + __declspec(align(16)) VUINT32 dThirtyOne[2][2]; + __declspec(align(16)) VUINT32 XScale[2][2]; +} __svml_dacosh_data_internal; +#endif +__svml_dacosh_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 16 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 16 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 16 + .quad 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 16 + .quad 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 16 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 16 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 16 + .quad 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 16 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 16 + .quad 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 16 + .quad 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 16 + .quad 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 16 + .quad 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 16 + .quad 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 16 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + /*== dBigThreshold ==*/ + .align 16 + .quad 0x41D0000000000000, 0x41D0000000000000 + /*== dLargestFinite ==*/ + .align 16 + .quad 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF + /*== dThirtyOne ==*/ + .align 16 + .quad 0x403F000000000000, 0x403F000000000000 + /*== XScale ==*/ + .align 16 + .quad 0x3E10000000000000, 0x3E10000000000000 + .align 16 + .type __svml_dacosh_data_internal,@object + .size __svml_dacosh_data_internal,.-__svml_dacosh_data_internal + .align 16 + +.FLT_20: + .long 0x00000000,0x43380000,0x00000000,0x43380000 + .type .FLT_20,@object + .size .FLT_20,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core-sse.S new file mode 100644 index 0000000000..cc524d4813 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized acosh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_acosh _ZGVdN4v_acosh_sse_wrapper +#include "../svml_d_acosh4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core.c new file mode 100644 index 0000000000..bb07c44f4b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized acosh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_acosh +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_acosh, __GI__ZGVdN4v_acosh, __redirect__ZGVdN4v_acosh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core_avx2.S new file mode 100644 index 0000000000..d824a7562e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh4_core_avx2.S @@ -0,0 +1,1533 @@ +/* Function acosh vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute acosh(x) as log(x + sqrt(x*x - 1)) + * + * Special cases: + * + * acosh(NaN) = quiet NaN, and raise invalid exception + * acosh(-INF) = NaN + * acosh(+INF) = +INF + * acosh(x) = NaN if x < 1 + * acosh(1) = +0 + * + */ + +/* Offsets for data table __svml_dacosh_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8224 +#define poly_coeff 12352 +#define ExpMask 12480 +#define Two10 12512 +#define MinLog1p 12544 +#define MaxLog1p 12576 +#define One 12608 +#define SgnMask 12640 +#define XThreshold 12672 +#define XhMask 12704 +#define Threshold 12736 +#define Bias 12768 +#define Bias1 12800 +#define ExpMask0 12832 +#define ExpMask2 12864 +#define L2 12896 +#define dBigThreshold 12928 +#define dC1 12960 +#define dC2 12992 +#define dC3 13024 +#define dC4 13056 +#define dC5 13088 +#define dLargestFinite 13120 +#define dThirtyOne 13152 +#define dTopMask12 13184 +#define dTopMask29 13216 +#define XScale 13248 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_acosh_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea -4218848+__svml_dacosh_data_internal(%rip), %r8 + +/* Load the constant 1 and possibly other stuff */ + vmovupd One+__svml_dacosh_data_internal(%rip), %ymm8 + +/* + * Now 1 / (1 + d) + * = 1 / (1 + (sqrt(1 - e) - 1)) + * = 1 / sqrt(1 - e) + * = 1 + 1/2 * e + 3/8 * e^2 + 5/16 * e^3 + 35/128 * e^4 + + * 63/256 * e^5 + 231/1024 * e^6 + .... + * So compute the first five nonconstant terms of that, so that + * we have a relative correction (1 + Corr) to apply to S etc. + * C1 = 1/2 + * C2 = 3/8 + * C3 = 5/16 + * C4 = 35/128 + * C5 = 63/256 + */ + vmovupd dC5+__svml_dacosh_data_internal(%rip), %ymm3 + vmovapd %ymm0, %ymm9 + vmovapd %ymm8, %ymm13 + vfmsub231pd %ymm9, %ymm9, %ymm13 + +/* + * Check that 1 < X < +inf; otherwise go to the callout function. + * We need the callout for X = 1 to avoid division by zero below. + * This test ensures that callout handles NaN and either infinity. + */ + vcmpnle_uqpd dLargestFinite+__svml_dacosh_data_internal(%rip), %ymm9, %ymm10 + vcmpngt_uqpd %ymm8, %ymm9, %ymm11 + +/* dU is needed later on */ + vsubpd %ymm8, %ymm9, %ymm6 + +/* + * The following computation can go wrong for very large X, e.g. + * the X^2 - 1 = U * V can overflow. But for large X we have + * acosh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when to do this. + */ + vcmplt_oqpd dBigThreshold+__svml_dacosh_data_internal(%rip), %ymm9, %ymm7 + +/* + * do the same thing but with NR iteration + * Finally, express Y + W = U * V accurately where Y has <= 29 bits + */ + vandpd dTopMask29+__svml_dacosh_data_internal(%rip), %ymm13, %ymm5 + +/* + * Compute R = 1/sqrt(Y + W) * (1 + d) + * Force R to <= 12 significant bits in case it isn't already + * This means that R * Y and R^2 * Y are exactly representable. + */ + vcvtpd2ps %ymm5, %xmm14 + vsubpd %ymm5, %ymm13, %ymm4 + vrsqrtps %xmm14, %xmm15 + vcvtps2pd %xmm15, %ymm0 + vandpd dTopMask12+__svml_dacosh_data_internal(%rip), %ymm0, %ymm2 + vorpd %ymm11, %ymm10, %ymm12 + +/* + * Compute S = (Y/sqrt(Y + W)) * (1 + d) + * and T = (W/sqrt(Y + W)) * (1 + d) + * so that S + T = sqrt(Y + W) * (1 + d) + * S is exact, and the rounding error in T is OK. + */ + vmulpd %ymm2, %ymm5, %ymm10 + vmulpd %ymm4, %ymm2, %ymm11 + +/* + * Compute e = -(2 * d + d^2) + * The first FMR is exact, and the rounding error in the other is acceptable + * since d and e are ~ 2^-12 + */ + vmovapd %ymm8, %ymm1 + vfnmadd231pd %ymm10, %ymm2, %ymm1 + +/* + * For low-accuracy versions, the computation can be done + * just as U + ((S + T) + (S + T) * Corr) + */ + vaddpd %ymm11, %ymm10, %ymm13 + vfnmadd231pd %ymm11, %ymm2, %ymm1 + vfmadd213pd dC4+__svml_dacosh_data_internal(%rip), %ymm1, %ymm3 + vfmadd213pd dC3+__svml_dacosh_data_internal(%rip), %ymm1, %ymm3 + vfmadd213pd dC2+__svml_dacosh_data_internal(%rip), %ymm1, %ymm3 + vfmadd213pd dC1+__svml_dacosh_data_internal(%rip), %ymm1, %ymm3 + vmovmskpd %ymm12, %eax + vmulpd %ymm3, %ymm1, %ymm12 + +/* Now multiplex to the case X = 2^-30 * input, Xl = dL = 0 in the "big" case. */ + vmulpd XScale+__svml_dacosh_data_internal(%rip), %ymm9, %ymm3 + vfmadd213pd %ymm13, %ymm12, %ymm13 + vaddpd %ymm13, %ymm6, %ymm6 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + vmaxpd %ymm6, %ymm8, %ymm4 + vminpd %ymm6, %ymm8, %ymm2 + vandpd SgnMask+__svml_dacosh_data_internal(%rip), %ymm6, %ymm14 + vcmplt_oqpd XThreshold+__svml_dacosh_data_internal(%rip), %ymm14, %ymm15 + vaddpd %ymm2, %ymm4, %ymm0 + vorpd XhMask+__svml_dacosh_data_internal(%rip), %ymm15, %ymm5 + vandpd %ymm5, %ymm0, %ymm6 + vblendvpd %ymm7, %ymm6, %ymm3, %ymm5 + vsubpd %ymm6, %ymm4, %ymm1 + +/* 2^ (-10-exp(X) ) */ + vmovupd ExpMask2+__svml_dacosh_data_internal(%rip), %ymm15 + vaddpd %ymm1, %ymm2, %ymm10 + +/* exponent bits */ + vpsrlq $20, %ymm5, %ymm2 + +/* + * Now resume the main code. + * preserve mantissa, set input exponent to 2^(-10) + */ + vandpd ExpMask+__svml_dacosh_data_internal(%rip), %ymm5, %ymm11 + vorpd Two10+__svml_dacosh_data_internal(%rip), %ymm11, %ymm12 + +/* reciprocal approximation good to at least 11 bits */ + vcvtpd2ps %ymm12, %xmm13 + vrcpps %xmm13, %xmm14 + +/* exponent*log(2.0) */ + vmovupd Threshold+__svml_dacosh_data_internal(%rip), %ymm13 + vcvtps2pd %xmm14, %ymm3 + vandpd %ymm7, %ymm10, %ymm4 + +/* exponent of X needed to scale Xl */ + vandps ExpMask0+__svml_dacosh_data_internal(%rip), %ymm5, %ymm0 + vpsubq %ymm0, %ymm15, %ymm6 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + vroundpd $0, %ymm3, %ymm3 + vextractf128 $1, %ymm2, %xmm1 + vshufps $221, %xmm1, %xmm2, %xmm10 + +/* biased exponent in DP format */ + vcvtdq2pd %xmm10, %ymm12 + +/* scale DblRcp */ + vmulpd %ymm6, %ymm3, %ymm2 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + vaddpd dThirtyOne+__svml_dacosh_data_internal(%rip), %ymm12, %ymm11 + +/* argument reduction */ + vfmsub213pd %ymm8, %ymm2, %ymm5 + vmulpd %ymm2, %ymm4, %ymm8 + vmovupd poly_coeff+64+__svml_dacosh_data_internal(%rip), %ymm2 + vblendvpd %ymm7, %ymm12, %ymm11, %ymm1 + +/* + * prepare table index + * table lookup + */ + vpsrlq $40, %ymm3, %ymm7 + vcmplt_oqpd %ymm3, %ymm13, %ymm3 + vandpd Bias+__svml_dacosh_data_internal(%rip), %ymm3, %ymm14 + vorpd Bias1+__svml_dacosh_data_internal(%rip), %ymm14, %ymm15 + vsubpd %ymm15, %ymm1, %ymm1 + +/* polynomial */ + vmovupd poly_coeff+__svml_dacosh_data_internal(%rip), %ymm3 + vmovd %xmm7, %edx + vextractf128 $1, %ymm7, %xmm10 + vpextrd $2, %xmm7, %ecx + vmulpd L2+__svml_dacosh_data_internal(%rip), %ymm1, %ymm7 + vaddpd %ymm8, %ymm5, %ymm1 + vmovd %xmm10, %esi + vsubpd %ymm5, %ymm1, %ymm5 + vfmadd213pd poly_coeff+32+__svml_dacosh_data_internal(%rip), %ymm1, %ymm3 + vfmadd213pd poly_coeff+96+__svml_dacosh_data_internal(%rip), %ymm1, %ymm2 + vsubpd %ymm5, %ymm8, %ymm4 + vmulpd %ymm1, %ymm1, %ymm8 + vfmadd213pd %ymm2, %ymm8, %ymm3 + movslq %edx, %rdx + movslq %esi, %rsi + vpextrd $2, %xmm10, %edi + movslq %ecx, %rcx + movslq %edi, %rdi + +/* + * reconstruction + * VQFMA( D, R, P, R2, R ); + */ + vfmadd213pd %ymm4, %ymm8, %ymm3 + vmovsd (%r8,%rdx), %xmm0 + vmovsd (%r8,%rsi), %xmm11 + vmovhpd (%r8,%rcx), %xmm0, %xmm6 + vmovhpd (%r8,%rdi), %xmm11, %xmm12 + vinsertf128 $1, %xmm12, %ymm6, %ymm0 + vaddpd %ymm3, %ymm1, %ymm6 + vaddpd %ymm6, %ymm0, %ymm0 + vaddpd %ymm0, %ymm7, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm9 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm9, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call acosh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_acosh_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dacosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(32)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(32)) VUINT32 poly_coeff[4][4][2]; + __declspec(align(32)) VUINT32 ExpMask[4][2]; + __declspec(align(32)) VUINT32 Two10[4][2]; + __declspec(align(32)) VUINT32 MinLog1p[4][2]; + __declspec(align(32)) VUINT32 MaxLog1p[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 SgnMask[4][2]; + __declspec(align(32)) VUINT32 XThreshold[4][2]; + __declspec(align(32)) VUINT32 XhMask[4][2]; + __declspec(align(32)) VUINT32 Threshold[4][2]; + __declspec(align(32)) VUINT32 Bias[4][2]; + __declspec(align(32)) VUINT32 Bias1[4][2]; + __declspec(align(32)) VUINT32 ExpMask0[4][2]; + __declspec(align(32)) VUINT32 ExpMask2[4][2]; + __declspec(align(32)) VUINT32 L2[4][2]; + __declspec(align(32)) VUINT32 dBigThreshold[4][2]; + __declspec(align(32)) VUINT32 dC1[4][2]; + __declspec(align(32)) VUINT32 dC2[4][2]; + __declspec(align(32)) VUINT32 dC3[4][2]; + __declspec(align(32)) VUINT32 dC4[4][2]; + __declspec(align(32)) VUINT32 dC5[4][2]; + __declspec(align(32)) VUINT32 dLargestFinite[4][2]; + __declspec(align(32)) VUINT32 dThirtyOne[4][2]; + __declspec(align(32)) VUINT32 dTopMask12[4][2]; + __declspec(align(32)) VUINT32 dTopMask29[4][2]; + __declspec(align(32)) VUINT32 XScale[4][2]; +} __svml_dacosh_data_internal; +#endif +__svml_dacosh_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 32 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 32 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 32 + .quad 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 32 + .quad 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 32 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 32 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 32 + .quad 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 32 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 32 + .quad 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 32 + .quad 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 32 + .quad 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 32 + .quad 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 32 + .quad 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 32 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + /*== dBigThreshold ==*/ + .align 32 + .quad 0x41D0000000000000, 0x41D0000000000000, 0x41D0000000000000, 0x41D0000000000000 + /*== dC1 ==*/ + .align 32 + .quad 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000 + /*== dC2 ==*/ + .align 32 + .quad 0x3fd7fffffffffffa, 0x3fd7fffffffffffa, 0x3fd7fffffffffffa, 0x3fd7fffffffffffa + /*== dC3 ==*/ + .align 32 + .quad 0x3fd3fffffffffffa, 0x3fd3fffffffffffa, 0x3fd3fffffffffffa, 0x3fd3fffffffffffa + /*== dC4 ==*/ + .align 32 + .quad 0x3fd1800013d9d428, 0x3fd1800013d9d428, 0x3fd1800013d9d428, 0x3fd1800013d9d428 + /*== dC5 ==*/ + .align 32 + .quad 0x3fcf800025de102f, 0x3fcf800025de102f, 0x3fcf800025de102f, 0x3fcf800025de102f + /*== dLargestFinite ==*/ + .align 32 + .quad 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF + /*== dThirtyOne ==*/ + .align 32 + .quad 0x403F000000000000, 0x403F000000000000, 0x403F000000000000, 0x403F000000000000 + /*== dTopMask12 ==*/ + .align 32 + .quad 0xFFFFFE0000000000, 0xFFFFFE0000000000, 0xFFFFFE0000000000, 0xFFFFFE0000000000 + /*== dTopMask29 ==*/ + .align 32 + .quad 0xFFFFFFFFFF000000, 0xFFFFFFFFFF000000, 0xFFFFFFFFFF000000, 0xFFFFFFFFFF000000 + /*== XScale ==*/ + .align 32 + .quad 0x3E10000000000000, 0x3E10000000000000, 0x3E10000000000000, 0x3E10000000000000 + .align 32 + .type __svml_dacosh_data_internal,@object + .size __svml_dacosh_data_internal,.-__svml_dacosh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core-avx2.S new file mode 100644 index 0000000000..48879787c1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized acosh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_acosh _ZGVeN8v_acosh_avx2_wrapper +#include "../svml_d_acosh8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core.c new file mode 100644 index 0000000000..4322a5f707 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized acosh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_acosh +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_acosh, __GI__ZGVeN8v_acosh, __redirect__ZGVeN8v_acosh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core_avx512.S new file mode 100644 index 0000000000..3199ef77e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_acosh8_core_avx512.S @@ -0,0 +1,480 @@ +/* Function acosh vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute acosh(x) as log(x + sqrt(x*x - 1)) + * using RSQRT instructions for starting the + * square root approximation, and small table lookups for log + * that map to AVX-512 permute instructions + * + * Special cases: + * + * acosh(NaN) = quiet NaN, and raise invalid exception + * acosh(-INF) = NaN + * acosh(+INF) = +INF + * acosh(x) = NaN if x < 1 + * acosh(1) = +0 + * + */ + +/* Offsets for data table __svml_dacosh_data_internal_avx512 + */ +#define Log_tbl_H 0 +#define Log_tbl_L 128 +#define One 256 +#define SmallThreshold 320 +#define Threshold 384 +#define LargeThreshold 448 +#define ca2 512 +#define ca1 576 +#define c4s 640 +#define c3s 704 +#define c2s 768 +#define c1s 832 +#define AddB5 896 +#define RcpBitMask 960 +#define OneEighth 1024 +#define Four 1088 +#define poly_coeff9 1152 +#define poly_coeff8 1216 +#define poly_coeff7 1280 +#define poly_coeff6 1344 +#define poly_coeff5 1408 +#define poly_coeff4 1472 +#define poly_coeff3 1536 +#define poly_coeff2 1600 +#define poly_coeff1 1664 +#define L2H 1728 +#define L2L 1792 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_acosh_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups One+__svml_dacosh_data_internal_avx512(%rip), %zmm5 + +/* polynomial computation for small inputs */ + vmovups ca2+__svml_dacosh_data_internal_avx512(%rip), %zmm13 + vmovups ca1+__svml_dacosh_data_internal_avx512(%rip), %zmm14 + +/* + * sqrt(1+x^2) ~ Sh + Sl + Sh*Eh*poly_s + * poly_s = c1+c2*Eh+c3*Eh^2 + */ + vmovups c4s+__svml_dacosh_data_internal_avx512(%rip), %zmm1 + vmovups c2s+__svml_dacosh_data_internal_avx512(%rip), %zmm2 + vmovups c1s+__svml_dacosh_data_internal_avx512(%rip), %zmm6 + +/* very large inputs ? */ + vmovups Threshold+__svml_dacosh_data_internal_avx512(%rip), %zmm15 + +/* out of range inputs? */ + vmovups LargeThreshold+__svml_dacosh_data_internal_avx512(%rip), %zmm3 + +/* not a very small input ? */ + vmovups SmallThreshold+__svml_dacosh_data_internal_avx512(%rip), %zmm10 + vmovaps %zmm0, %zmm12 + +/* x^2 - 1 */ + vmovaps %zmm5, %zmm11 + vfmsub231pd {rn-sae}, %zmm12, %zmm12, %zmm11 + vcmppd $21, {sae}, %zmm15, %zmm12, %k2 + vcmppd $22, {sae}, %zmm3, %zmm12, %k0 + vcmppd $18, {sae}, %zmm5, %zmm12, %k1 + vrsqrt14pd %zmm11, %zmm4 + vcmppd $21, {sae}, %zmm10, %zmm11, %k3 + vfmadd231pd {rn-sae}, %zmm11, %zmm13, %zmm14 + vmovups c3s+__svml_dacosh_data_internal_avx512(%rip), %zmm13 + +/* Sh ~sqrt(-1+x^2) */ + vmulpd {rn-sae}, %zmm4, %zmm11, %zmm9 + vmulpd {rn-sae}, %zmm11, %zmm14, %zmm8 + +/* Sh+x */ + vaddpd {rn-sae}, %zmm12, %zmm9, %zmm15 + +/* Shh */ + vsubpd {rn-sae}, %zmm12, %zmm15, %zmm14 + +/* (Yh*R0)_low */ + vmovaps %zmm11, %zmm0 + korw %k0, %k1, %k0 + +/* rel. error term: Eh=1-Sh*R0 */ + vmovaps %zmm5, %zmm7 + vfmsub213pd {rn-sae}, %zmm9, %zmm4, %zmm0 + vfnmadd231pd {rn-sae}, %zmm9, %zmm4, %zmm7 + +/* rel. error term: Eh=(1-Sh*R0)-Sl*R0 */ + vfnmadd231pd {rn-sae}, %zmm0, %zmm4, %zmm7 + +/* Shl */ + vsubpd {rn-sae}, %zmm14, %zmm9, %zmm4 + vmovups poly_coeff7+__svml_dacosh_data_internal_avx512(%rip), %zmm14 + vfmadd231pd {rn-sae}, %zmm7, %zmm1, %zmm13 + vfmadd213pd {rn-sae}, %zmm2, %zmm7, %zmm13 + vfmadd213pd {rn-sae}, %zmm6, %zmm7, %zmm13 + +/* Sh*Eh */ + vmulpd {rn-sae}, %zmm7, %zmm9, %zmm7 + +/* Sl + Sh*Eh*poly_s */ + vfmadd213pd {rn-sae}, %zmm0, %zmm13, %zmm7 + +/* polynomials */ + vmovups poly_coeff9+__svml_dacosh_data_internal_avx512(%rip), %zmm13 + +/* polynomial computation for small inputs */ + vaddpd {rn-sae}, %zmm7, %zmm9, %zmm0 + +/* Xin0+Sl+Sh*Eh*poly_s ~ x+sqrt(1+x^2) */ + vaddpd {rn-sae}, %zmm7, %zmm15, %zmm6 + vfmadd231pd {rn-sae}, %zmm0, %zmm8, %zmm0 + +/* fixup for very large inputs */ + vmovups OneEighth+__svml_dacosh_data_internal_avx512(%rip), %zmm8 + +/* Sl_high */ + vsubpd {rn-sae}, %zmm15, %zmm6, %zmm9 + vmovups poly_coeff6+__svml_dacosh_data_internal_avx512(%rip), %zmm15 + vmulpd {rn-sae}, %zmm8, %zmm12, %zmm6{%k2} + +/* Sl_l */ + vsubpd {rn-sae}, %zmm9, %zmm7, %zmm3 + vrcp14pd %zmm6, %zmm1 + +/* Xin_low */ + vaddpd {rn-sae}, %zmm4, %zmm3, %zmm7 + +/* Table lookups */ + vmovups __svml_dacosh_data_internal_avx512(%rip), %zmm3 + +/* round reciprocal to 1+4b mantissas */ + vpaddq AddB5+__svml_dacosh_data_internal_avx512(%rip), %zmm1, %zmm2 + +/* fixup for very large inputs */ + vxorpd %zmm7, %zmm7, %zmm7{%k2} + vmovups poly_coeff8+__svml_dacosh_data_internal_avx512(%rip), %zmm1 + vandpd RcpBitMask+__svml_dacosh_data_internal_avx512(%rip), %zmm2, %zmm8 + vmovups Log_tbl_L+__svml_dacosh_data_internal_avx512(%rip), %zmm2 + +/* Prepare table index */ + vpsrlq $48, %zmm8, %zmm9 + +/* reduced argument for log(): (Rcp*Xin-1)+Rcp*Xin_low */ + vfmsub231pd {rn-sae}, %zmm8, %zmm6, %zmm5 + +/* exponents */ + vgetexppd {sae}, %zmm8, %zmm4 + vmovups Four+__svml_dacosh_data_internal_avx512(%rip), %zmm6 + vpermt2pd Log_tbl_H+64+__svml_dacosh_data_internal_avx512(%rip), %zmm9, %zmm3 + vpermt2pd Log_tbl_L+64+__svml_dacosh_data_internal_avx512(%rip), %zmm9, %zmm2 + vsubpd {rn-sae}, %zmm6, %zmm4, %zmm4{%k2} + vfmadd231pd {rn-sae}, %zmm8, %zmm7, %zmm5 + vmovups poly_coeff5+__svml_dacosh_data_internal_avx512(%rip), %zmm6 + vmovups poly_coeff4+__svml_dacosh_data_internal_avx512(%rip), %zmm7 + +/* -K*L2H + Th */ + vmovups L2H+__svml_dacosh_data_internal_avx512(%rip), %zmm8 + +/* -K*L2L + Tl */ + vmovups L2L+__svml_dacosh_data_internal_avx512(%rip), %zmm9 + vfmadd231pd {rn-sae}, %zmm5, %zmm13, %zmm1 + vmovups poly_coeff2+__svml_dacosh_data_internal_avx512(%rip), %zmm13 + vfnmadd231pd {rn-sae}, %zmm4, %zmm8, %zmm3 + vfnmadd213pd {rn-sae}, %zmm2, %zmm9, %zmm4 + vfmadd213pd {rn-sae}, %zmm14, %zmm5, %zmm1 + vmovups poly_coeff3+__svml_dacosh_data_internal_avx512(%rip), %zmm2 + vmovups poly_coeff1+__svml_dacosh_data_internal_avx512(%rip), %zmm14 + vfmadd213pd {rn-sae}, %zmm15, %zmm5, %zmm1 + +/* R^2 */ + vmulpd {rn-sae}, %zmm5, %zmm5, %zmm15 + vfmadd213pd {rn-sae}, %zmm6, %zmm5, %zmm1 + vfmadd213pd {rn-sae}, %zmm7, %zmm5, %zmm1 + vfmadd213pd {rn-sae}, %zmm2, %zmm5, %zmm1 + vfmadd213pd {rn-sae}, %zmm13, %zmm5, %zmm1 + vfmadd213pd {rn-sae}, %zmm14, %zmm5, %zmm1 + +/* Tl + R^2*Poly */ + vfmadd213pd {rn-sae}, %zmm4, %zmm15, %zmm1 + +/* R+Tl + R^2*Poly */ + vaddpd {rn-sae}, %zmm5, %zmm1, %zmm5 + vaddpd {rn-sae}, %zmm5, %zmm3, %zmm0{%k3} + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 k0 zmm0 zmm12 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm12, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 k0 zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax k0 + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + kmovd %k0, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call acosh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_acosh_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dacosh_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl_H[16][2]; + __declspec(align(64)) VUINT32 Log_tbl_L[16][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 SmallThreshold[8][2]; + __declspec(align(64)) VUINT32 Threshold[8][2]; + __declspec(align(64)) VUINT32 LargeThreshold[8][2]; + __declspec(align(64)) VUINT32 ca2[8][2]; + __declspec(align(64)) VUINT32 ca1[8][2]; + __declspec(align(64)) VUINT32 c4s[8][2]; + __declspec(align(64)) VUINT32 c3s[8][2]; + __declspec(align(64)) VUINT32 c2s[8][2]; + __declspec(align(64)) VUINT32 c1s[8][2]; + __declspec(align(64)) VUINT32 AddB5[8][2]; + __declspec(align(64)) VUINT32 RcpBitMask[8][2]; + __declspec(align(64)) VUINT32 OneEighth[8][2]; + __declspec(align(64)) VUINT32 Four[8][2]; + __declspec(align(64)) VUINT32 poly_coeff9[8][2]; + __declspec(align(64)) VUINT32 poly_coeff8[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + __declspec(align(64)) VUINT32 L2H[8][2]; + __declspec(align(64)) VUINT32 L2L[8][2]; + } __svml_dacosh_data_internal_avx512; +#endif +__svml_dacosh_data_internal_avx512: + /*== Log_tbl_H ==*/ + .quad 0x0000000000000000 + .quad 0xbfaf0a30c0120000 + .quad 0xbfbe27076e2b0000 + .quad 0xbfc5ff3070a78000 + .quad 0xbfcc8ff7c79a8000 + .quad 0xbfd1675cababc000 + .quad 0xbfd4618bc21c4000 + .quad 0xbfd739d7f6bbc000 + .quad 0xbfd9f323ecbf8000 + .quad 0xbfdc8ff7c79a8000 + .quad 0xbfdf128f5faf0000 + .quad 0xbfe0be72e4252000 + .quad 0xbfe1e85f5e704000 + .quad 0xbfe307d7334f2000 + .quad 0xbfe41d8fe8468000 + .quad 0xbfe52a2d265bc000 + /*== Log_tbl_L ==*/ + .align 64 + .quad 0x0000000000000000 + .quad 0x3d53ab33d066d1d2 + .quad 0x3d2a342c2af0003c + .quad 0xbd43d3c873e20a07 + .quad 0xbd4a21ac25d81ef3 + .quad 0x3d59f1fc63382a8f + .quad 0xbd5ec27d0b7b37b3 + .quad 0xbd50069ce24c53fb + .quad 0xbd584bf2b68d766f + .quad 0xbd5a21ac25d81ef3 + .quad 0xbd3bb2cd720ec44c + .quad 0xbd55056d312f7668 + .quad 0xbd1a07bd8b34be7c + .quad 0x3d5e83c094debc15 + .quad 0x3d5aa33736867a17 + .quad 0xbd46abb9df22bc57 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== SmallThreshold ==*/ + .align 64 + .quad 0x3ef0000000000000, 0x3ef0000000000000, 0x3ef0000000000000, 0x3ef0000000000000, 0x3ef0000000000000, 0x3ef0000000000000, 0x3ef0000000000000, 0x3ef0000000000000 + /*== Threshold ==*/ + .align 64 + .quad 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000 + /*== LargeThreshold ==*/ + .align 64 + .quad 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff + /*== ca2 ==*/ + .align 64 + .quad 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7 + /*== ca1 ==*/ + .align 64 + .quad 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e + /*== c4s ==*/ + .align 64 + .quad 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612 + /*== c3s ==*/ + .align 64 + .quad 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000 + /*== c2s ==*/ + .align 64 + .quad 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000 + /*== c1s ==*/ + .align 64 + .quad 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000 + /*== AddB5 ==*/ + .align 64 + .quad 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000 + /*== RcpBitMask ==*/ + .align 64 + .quad 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000 + /*==OneEighth ==*/ + .align 64 + .quad 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000 + /*== Four ==*/ + .align 64 + .quad 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000 + /*== poly_coeff9 ==*/ + .align 64 + .quad 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368 + /*== poly_coeff8 ==*/ + .align 64 + .quad 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778 + /*== poly_coeff7 ==*/ + .align 64 + .quad 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9 + /*== poly_coeff6 ==*/ + .align 64 + .quad 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1 + /*== poly_coeff5 ==*/ + .align 64 + .quad 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736 + /*== poly_coeff4 ==*/ + .align 64 + .quad 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af + /*== poly_coeff3 ==*/ + .align 64 + .quad 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65 + /*== poly_coeff2 ==*/ + .align 64 + .quad 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1 + /*== poly_coeff1 ==*/ + .align 64 + .quad 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .quad 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000 + /*== L2L = log(2)_low ==*/ + .align 64 + .quad 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000 + .align 64 + .type __svml_dacosh_data_internal_avx512,@object + .size __svml_dacosh_data_internal_avx512,.-__svml_dacosh_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core-avx2.S new file mode 100644 index 0000000000..a54c6863c5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized acoshf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_acoshf _ZGVeN16v_acoshf_avx2_wrapper +#include "../svml_s_acoshf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core.c new file mode 100644 index 0000000000..8109b73ebf --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized acoshf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_acoshf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_acoshf, __GI__ZGVeN16v_acoshf, + __redirect__ZGVeN16v_acoshf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core_avx512.S new file mode 100644 index 0000000000..688ca38669 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf16_core_avx512.S @@ -0,0 +1,449 @@ +/* Function acoshf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute acosh(x) as log(x + sqrt(x*x - 1)) + * using RSQRT instructions for starting the + * square root approximation, and small table lookups for log + * that map to AVX-512 permute instructions + * + * Special cases: + * + * acosh(NaN) = quiet NaN, and raise invalid exception + * acosh(-INF) = NaN + * acosh(+INF) = +INF + * acosh(x) = NaN if x < 1 + * acosh(1) = +0 + * + */ + +/* Offsets for data table __svml_sacosh_data_internal_avx512 + */ +#define Log_tbl_H 0 +#define Log_tbl_L 128 +#define One 256 +#define SmallThreshold 320 +#define Threshold 384 +#define LargeThreshold 448 +#define ca1 512 +#define c2s 576 +#define c1s 640 +#define AddB5 704 +#define RcpBitMask 768 +#define OneEighth 832 +#define Four 896 +#define poly_coeff3 960 +#define poly_coeff2 1024 +#define poly_coeff1 1088 +#define L2H 1152 +#define L2L 1216 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_acoshf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovups One+__svml_sacosh_data_internal_avx512(%rip), %zmm1 + +/* + * sqrt(1+x^2) ~ Sh + Sl + Sh*Eh*poly_s + * poly_s = c1+c2*Eh + */ + vmovups c2s+__svml_sacosh_data_internal_avx512(%rip), %zmm13 + vmovups c1s+__svml_sacosh_data_internal_avx512(%rip), %zmm15 + +/* polynomial computation for small inputs */ + vmovups ca1+__svml_sacosh_data_internal_avx512(%rip), %zmm9 + +/* very large inputs ? */ + vmovups Threshold+__svml_sacosh_data_internal_avx512(%rip), %zmm10 + +/* out of range inputs? */ + vmovups LargeThreshold+__svml_sacosh_data_internal_avx512(%rip), %zmm11 + +/* not a very small input ? */ + vmovups SmallThreshold+__svml_sacosh_data_internal_avx512(%rip), %zmm6 + vmovaps %zmm0, %zmm8 + +/* x^2 - 1 */ + vmovaps %zmm1, %zmm7 + vfmsub231ps {rn-sae}, %zmm8, %zmm8, %zmm7 + vcmpps $21, {sae}, %zmm10, %zmm8, %k2 + vcmpps $22, {sae}, %zmm11, %zmm8, %k0 + vcmpps $18, {sae}, %zmm1, %zmm8, %k1 + vrsqrt14ps %zmm7, %zmm12 + vcmpps $21, {sae}, %zmm6, %zmm7, %k3 + vmulps {rn-sae}, %zmm9, %zmm7, %zmm4 + +/* Sh ~sqrt(-1+x^2) */ + vmulps {rn-sae}, %zmm12, %zmm7, %zmm5 + +/* Sh+x */ + vaddps {rn-sae}, %zmm8, %zmm5, %zmm9 + +/* (Yh*R0)_low */ + vmovaps %zmm7, %zmm0 + korw %k0, %k1, %k0 + +/* rel. error term: Eh=1-Sh*R0 */ + vmovaps %zmm1, %zmm14 + vfmsub213ps {rn-sae}, %zmm5, %zmm12, %zmm0 + vfnmadd231ps {rn-sae}, %zmm5, %zmm12, %zmm14 + +/* rel. error term: Eh=(1-Sh*R0)-Sl*R0 */ + vfnmadd231ps {rn-sae}, %zmm0, %zmm12, %zmm14 + +/* Sh*Eh */ + vmulps {rn-sae}, %zmm14, %zmm5, %zmm3 + vfmadd231ps {rn-sae}, %zmm14, %zmm13, %zmm15 + +/* Sl + Sh*Eh*poly_s */ + vfmadd213ps {rn-sae}, %zmm0, %zmm15, %zmm3 + +/* Shh */ + vsubps {rn-sae}, %zmm8, %zmm9, %zmm15 + +/* polynomial computation for small inputs */ + vaddps {rn-sae}, %zmm3, %zmm5, %zmm0 + +/* Xin0+Sl+Sh*Eh*poly_s ~ x+sqrt(1+x^2) */ + vaddps {rn-sae}, %zmm3, %zmm9, %zmm2 + +/* Shl */ + vsubps {rn-sae}, %zmm15, %zmm5, %zmm10 + vfmadd231ps {rn-sae}, %zmm0, %zmm4, %zmm0 + +/* fixup for very large inputs */ + vmovups OneEighth+__svml_sacosh_data_internal_avx512(%rip), %zmm4 + +/* Sl_high */ + vsubps {rn-sae}, %zmm9, %zmm2, %zmm5 + +/* polynomial */ + vmovups poly_coeff3+__svml_sacosh_data_internal_avx512(%rip), %zmm9 + vmulps {rn-sae}, %zmm4, %zmm8, %zmm2{%k2} + +/* -K*L2L + Tl */ + vmovups L2L+__svml_sacosh_data_internal_avx512(%rip), %zmm4 + +/* Sl_l */ + vsubps {rn-sae}, %zmm5, %zmm3, %zmm3 + vrcp14ps %zmm2, %zmm11 + vmovups Log_tbl_L+__svml_sacosh_data_internal_avx512(%rip), %zmm5 + +/* Xin_low */ + vaddps {rn-sae}, %zmm10, %zmm3, %zmm13 + +/* round reciprocal to 1+4b mantissas */ + vpaddd AddB5+__svml_sacosh_data_internal_avx512(%rip), %zmm11, %zmm12 + vmovups poly_coeff1+__svml_sacosh_data_internal_avx512(%rip), %zmm10 + vandps RcpBitMask+__svml_sacosh_data_internal_avx512(%rip), %zmm12, %zmm14 + +/* fixup for very large inputs */ + vxorps %zmm13, %zmm13, %zmm13{%k2} + +/* reduced argument for log(): (Rcp*Xin-1)+Rcp*Xin_low */ + vfmsub231ps {rn-sae}, %zmm14, %zmm2, %zmm1 + +/* exponents */ + vgetexpps {sae}, %zmm14, %zmm12 + vmovups Four+__svml_sacosh_data_internal_avx512(%rip), %zmm2 + +/* Prepare table index */ + vpsrld $18, %zmm14, %zmm3 + vfmadd231ps {rn-sae}, %zmm14, %zmm13, %zmm1 + vmovups poly_coeff2+__svml_sacosh_data_internal_avx512(%rip), %zmm13 + +/* Table lookups */ + vmovups __svml_sacosh_data_internal_avx512(%rip), %zmm14 + vsubps {rn-sae}, %zmm2, %zmm12, %zmm12{%k2} + vpermt2ps Log_tbl_L+64+__svml_sacosh_data_internal_avx512(%rip), %zmm3, %zmm5 + vpermt2ps Log_tbl_H+64+__svml_sacosh_data_internal_avx512(%rip), %zmm3, %zmm14 + +/* R^2 */ + vmulps {rn-sae}, %zmm1, %zmm1, %zmm11 + +/* -K*L2H + Th */ + vmovups L2H+__svml_sacosh_data_internal_avx512(%rip), %zmm2 + vfmadd231ps {rn-sae}, %zmm1, %zmm9, %zmm13 + vfnmadd231ps {rn-sae}, %zmm12, %zmm2, %zmm14 + vfnmadd213ps {rn-sae}, %zmm5, %zmm4, %zmm12 + vfmadd213ps {rn-sae}, %zmm10, %zmm1, %zmm13 + +/* Tl + R^2*Poly */ + vfmadd213ps {rn-sae}, %zmm12, %zmm11, %zmm13 + +/* R+Tl + R^2*Poly */ + vaddps {rn-sae}, %zmm1, %zmm13, %zmm1 + vaddps {rn-sae}, %zmm1, %zmm14, %zmm0{%k3} + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 k0 zmm0 zmm8 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm8, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 k0 zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax k0 + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + kmovd %k0, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call acoshf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_acoshf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_sacosh_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl_H[32][1]; + __declspec(align(64)) VUINT32 Log_tbl_L[32][1]; + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 SmallThreshold[16][1]; + __declspec(align(64)) VUINT32 Threshold[16][1]; + __declspec(align(64)) VUINT32 LargeThreshold[16][1]; + __declspec(align(64)) VUINT32 ca1[16][1]; + __declspec(align(64)) VUINT32 c2s[16][1]; + __declspec(align(64)) VUINT32 c1s[16][1]; + __declspec(align(64)) VUINT32 AddB5[16][1]; + __declspec(align(64)) VUINT32 RcpBitMask[16][1]; + __declspec(align(64)) VUINT32 OneEighth[16][1]; + __declspec(align(64)) VUINT32 Four[16][1]; + __declspec(align(64)) VUINT32 poly_coeff3[16][1]; + __declspec(align(64)) VUINT32 poly_coeff2[16][1]; + __declspec(align(64)) VUINT32 poly_coeff1[16][1]; + __declspec(align(64)) VUINT32 L2H[16][1]; + __declspec(align(64)) VUINT32 L2L[16][1]; + } __svml_sacosh_data_internal_avx512; +#endif +__svml_sacosh_data_internal_avx512: + /*== Log_tbl_H ==*/ + .long 0x00000000 + .long 0xbcfc0000 + .long 0xbd788000 + .long 0xbdb78000 + .long 0xbdf14000 + .long 0xbe14a000 + .long 0xbe300000 + .long 0xbe4aa000 + .long 0xbe648000 + .long 0xbe7dc000 + .long 0xbe8b4000 + .long 0xbe974000 + .long 0xbea31000 + .long 0xbeae9000 + .long 0xbeb9d000 + .long 0xbec4d000 + .long 0xbecfa000 + .long 0xbeda2000 + .long 0xbee48000 + .long 0xbeeea000 + .long 0xbef89000 + .long 0xbf012800 + .long 0xbf05f000 + .long 0xbf0aa800 + .long 0xbf0f4000 + .long 0xbf13c800 + .long 0xbf184000 + .long 0xbf1ca000 + .long 0xbf20f000 + .long 0xbf252800 + .long 0xbf295000 + .long 0xbf2d6800 + /*== Log_tbl_L ==*/ + .align 64 + .long 0x80000000 + .long 0xb726c39e + .long 0x3839e7fe + .long 0xb7528ae5 + .long 0x377891d5 + .long 0xb8297c10 + .long 0x37cf8f58 + .long 0x3852b186 + .long 0x35838656 + .long 0xb80c36af + .long 0x38235454 + .long 0xb862bae1 + .long 0x37e87bc7 + .long 0x37848150 + .long 0x37202511 + .long 0xb74e1b05 + .long 0x385c1340 + .long 0xb8777bcd + .long 0x36038656 + .long 0xb7d40984 + .long 0xb80f5faf + .long 0xb8254b4c + .long 0xb865c84a + .long 0x37f0b42d + .long 0xb83ebce1 + .long 0xb83c2513 + .long 0x37a332c4 + .long 0x3779654f + .long 0x38602f73 + .long 0x367449f8 + .long 0xb7b4996f + .long 0xb800986b + /*== One ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== SmallThreshold ==*/ + .align 64 + .long 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000, 0x39800000 + /*== Threshold ==*/ + .align 64 + .long 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000 + /*== LargeThreshold ==*/ + .align 64 + .long 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff + /*== ca1 ==*/ + .align 64 + .long 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE + /*== c2s ==*/ + .align 64 + .long 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000 + /*== c1s ==*/ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 + /*== AddB5 ==*/ + .align 64 + .long 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000 + /*== RcpBitMask ==*/ + .align 64 + .long 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000 + /*==OneEighth ==*/ + .align 64 + .long 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000 + /*== Four ==*/ + .align 64 + .long 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000 + /*== poly_coeff3 ==*/ + .align 64 + .long 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810 + /*== poly_coeff2 ==*/ + .align 64 + .long 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e + /*== poly_coeff1 ==*/ + .align 64 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .long 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000 + /*== L2L = log(2)_low ==*/ + .align 64 + .long 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4 + .align 64 + .type __svml_sacosh_data_internal_avx512,@object + .size __svml_sacosh_data_internal_avx512,.-__svml_sacosh_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core-sse2.S new file mode 100644 index 0000000000..d789ec1d47 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized acoshf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_acoshf _ZGVbN4v_acoshf_sse2 +#include "../svml_s_acoshf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core.c new file mode 100644 index 0000000000..b2d9101c47 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized acoshf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_acoshf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_acoshf, __GI__ZGVbN4v_acoshf, + __redirect__ZGVbN4v_acoshf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core_sse4.S new file mode 100644 index 0000000000..e897ea304f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf4_core_sse4.S @@ -0,0 +1,389 @@ +/* Function acoshf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute acosh(x) as log(x + sqrt(x*x - 1)) + * + * Special cases: + * + * acosh(NaN) = quiet NaN, and raise invalid exception + * acosh(-INF) = NaN + * acosh(+INF) = +INF + * acosh(x) = NaN if x < 1 + * acosh(1) = +0 + * + */ + +/* Offsets for data table __svml_sacosh_data_internal + */ +#define sOne 0 +#define sPoly 16 +#define iBrkValue 144 +#define iOffExpoMask 160 +#define sBigThreshold 176 +#define sC2 192 +#define sC3 208 +#define sHalf 224 +#define sLargestFinite 240 +#define sThirtyOne 256 +#define sTopMask8 272 +#define XScale 288 +#define sLn2 304 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_acoshf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + +/* Compute U = X - 1 and V = X + 1, naively first. */ + movaps %xmm0, %xmm12 + +/* Load constants, always including One = 1 */ + movups sOne+__svml_sacosh_data_internal(%rip), %xmm2 + +/* + * Check that 1 < X < +inf; otherwise go to the callout function. + * We need the callout for X = 1 to avoid division by zero below. + * This test ensures that callout handles NaN and either infinity. + */ + movaps %xmm0, %xmm4 + movaps %xmm2, %xmm9 + +/* + * Compute e = -(2 * d + d^2) + * The first FMR is exact, and the rounding error in the other is acceptable + * since d and e are ~ 2^-8 + */ + movaps %xmm2, %xmm10 + +/* Finally, express Y + W = U * V accurately where Y has <= 8 bits */ + movups sTopMask8+__svml_sacosh_data_internal(%rip), %xmm5 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + movaps %xmm2, %xmm13 + movaps %xmm5, %xmm11 + movaps %xmm2, %xmm3 + +/* + * Now 1 / (1 + d) + * = 1 / (1 + (sqrt(1 - e) - 1)) + * = 1 / sqrt(1 - e) + * = 1 + 1/2 * e + 3/8 * e^2 + 5/16 * e^3 + 35/128 * e^4 + ... + * So compute the first three nonconstant terms of that, so that + * we have a relative correction (1 + Corr) to apply to S etc. + * C1 = 1/2 + * C2 = 3/8 + * C3 = 5/16 + */ + movups sC3+__svml_sacosh_data_internal(%rip), %xmm8 + +/* + * The following computation can go wrong for very large X, e.g. + * the X^2 - 1 = U * V can overflow. But for large X we have + * acosh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when to do this. + */ + movaps %xmm0, %xmm1 + cmpnleps sLargestFinite+__svml_sacosh_data_internal(%rip), %xmm4 + cmpltps sBigThreshold+__svml_sacosh_data_internal(%rip), %xmm1 + cmpnltps %xmm0, %xmm3 + subps %xmm2, %xmm12 + addps %xmm0, %xmm9 + +/* For low-accuracy versions, naivety is harmless */ + mulps %xmm12, %xmm9 + orps %xmm3, %xmm4 + movmskps %xmm4, %edx + andps %xmm9, %xmm11 + movaps %xmm1, %xmm3 + +/* + * Compute R = 1/sqrt(Y + W) * (1 + d) + * Force R to <= 8 significant bits. + * This means that R * Y and R^2 * Y are exactly representable. + */ + rsqrtps %xmm11, %xmm7 + subps %xmm11, %xmm9 + andps %xmm5, %xmm7 + movaps %xmm2, %xmm4 + +/* + * Compute S = (Y/sqrt(Y + W)) * (1 + d) + * and T = (W/sqrt(Y + W)) * (1 + d) + * so that S + T = sqrt(Y + W) * (1 + d) + * S is exact, and the rounding error in T is OK. + */ + mulps %xmm7, %xmm11 + movaps %xmm7, %xmm6 + mulps %xmm7, %xmm9 + mulps %xmm11, %xmm6 + mulps %xmm9, %xmm7 + +/* + * For low-accuracy versions, the computation can be done + * just as U + ((S + T) + (S + T) * Corr) + */ + addps %xmm9, %xmm11 + subps %xmm6, %xmm10 + movaps %xmm2, %xmm9 + subps %xmm7, %xmm10 + mulps %xmm10, %xmm8 + +/* Now multiplex to the case X = 2^-30 * input, Xl = 0 in the "big" case. */ + movups XScale+__svml_sacosh_data_internal(%rip), %xmm14 + mulps %xmm0, %xmm14 + addps sC2+__svml_sacosh_data_internal(%rip), %xmm8 + mulps %xmm10, %xmm8 + andnps %xmm14, %xmm3 + +/* + * Now resume the main code. + * reduction: compute r,n + */ + movdqu iBrkValue+__svml_sacosh_data_internal(%rip), %xmm14 + movdqu iOffExpoMask+__svml_sacosh_data_internal(%rip), %xmm5 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + movups sThirtyOne+__svml_sacosh_data_internal(%rip), %xmm6 + addps sHalf+__svml_sacosh_data_internal(%rip), %xmm8 + mulps %xmm8, %xmm10 + movaps %xmm1, %xmm8 + mulps %xmm11, %xmm10 + addps %xmm10, %xmm11 + addps %xmm11, %xmm12 + maxps %xmm12, %xmm13 + minps %xmm12, %xmm9 + movaps %xmm13, %xmm15 + addps %xmm9, %xmm15 + subps %xmm15, %xmm13 + andps %xmm1, %xmm15 + orps %xmm15, %xmm3 + addps %xmm13, %xmm9 + psubd %xmm14, %xmm3 + andps %xmm1, %xmm9 + pand %xmm3, %xmm5 + psrad $23, %xmm3 + cvtdq2ps %xmm3, %xmm7 + pslld $23, %xmm3 + paddd %xmm14, %xmm5 + psubd %xmm3, %xmm4 + +/* polynomial evaluation */ + subps %xmm2, %xmm5 + mulps %xmm4, %xmm9 + addps %xmm7, %xmm6 + movups sPoly+112+__svml_sacosh_data_internal(%rip), %xmm2 + andnps %xmm6, %xmm8 + andps %xmm1, %xmm7 + addps %xmm5, %xmm9 + mulps %xmm9, %xmm2 + orps %xmm7, %xmm8 + +/* final reconstruction */ + mulps sLn2+__svml_sacosh_data_internal(%rip), %xmm8 + addps sPoly+96+__svml_sacosh_data_internal(%rip), %xmm2 + mulps %xmm9, %xmm2 + addps sPoly+80+__svml_sacosh_data_internal(%rip), %xmm2 + mulps %xmm9, %xmm2 + addps sPoly+64+__svml_sacosh_data_internal(%rip), %xmm2 + mulps %xmm9, %xmm2 + addps sPoly+48+__svml_sacosh_data_internal(%rip), %xmm2 + mulps %xmm9, %xmm2 + addps sPoly+32+__svml_sacosh_data_internal(%rip), %xmm2 + mulps %xmm9, %xmm2 + addps sPoly+16+__svml_sacosh_data_internal(%rip), %xmm2 + mulps %xmm9, %xmm2 + addps sPoly+__svml_sacosh_data_internal(%rip), %xmm2 + mulps %xmm9, %xmm2 + mulps %xmm9, %xmm2 + addps %xmm2, %xmm9 + addps %xmm8, %xmm9 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm9 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movaps %xmm9, %xmm0 + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm0, 32(%rsp) + movups %xmm9, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm9 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm9 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call acoshf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_acoshf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_sacosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 sOne[4][1]; + __declspec(align(16)) VUINT32 sPoly[8][4][1]; + __declspec(align(16)) VUINT32 iBrkValue[4][1]; + __declspec(align(16)) VUINT32 iOffExpoMask[4][1]; + __declspec(align(16)) VUINT32 sBigThreshold[4][1]; + __declspec(align(16)) VUINT32 sC2[4][1]; + __declspec(align(16)) VUINT32 sC3[4][1]; + __declspec(align(16)) VUINT32 sHalf[4][1]; + __declspec(align(16)) VUINT32 sLargestFinite[4][1]; + __declspec(align(16)) VUINT32 sThirtyOne[4][1]; + __declspec(align(16)) VUINT32 sTopMask8[4][1]; + __declspec(align(16)) VUINT32 XScale[4][1]; + __declspec(align(16)) VUINT32 sLn2[4][1]; +} __svml_sacosh_data_internal; +#endif +__svml_sacosh_data_internal: + /*== sOne = SP 1.0 ==*/ + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 16 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iBrkValue = SP 2/3 ==*/ + .align 16 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 16 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sBigThreshold ==*/ + .align 16 + .long 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000 + /*== sC2 ==*/ + .align 16 + .long 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000 + /*== sC3 ==*/ + .align 16 + .long 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000 + /*== sHalf ==*/ + .align 16 + .long 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000 + /*== sLargestFinite ==*/ + .align 16 + .long 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF + /*== sThirtyOne ==*/ + .align 16 + .long 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000 + /*== sTopMask8 ==*/ + .align 16 + .long 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000 + /*== XScale ==*/ + .align 16 + .long 0x30800000, 0x30800000, 0x30800000, 0x30800000 + /*== sLn2 = SP ln(2) ==*/ + .align 16 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 16 + .type __svml_sacosh_data_internal,@object + .size __svml_sacosh_data_internal,.-__svml_sacosh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core-sse.S new file mode 100644 index 0000000000..cb97d291c5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized acoshf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_acoshf _ZGVdN8v_acoshf_sse_wrapper +#include "../svml_s_acoshf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core.c new file mode 100644 index 0000000000..db71194cd0 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized acoshf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_acoshf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_acoshf, __GI__ZGVdN8v_acoshf, + __redirect__ZGVdN8v_acoshf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core_avx2.S new file mode 100644 index 0000000000..1d847fcd40 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_acoshf8_core_avx2.S @@ -0,0 +1,370 @@ +/* Function acoshf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute acosh(x) as log(x + sqrt(x*x - 1)) + * + * Special cases: + * + * acosh(NaN) = quiet NaN, and raise invalid exception + * acosh(-INF) = NaN + * acosh(+INF) = +INF + * acosh(x) = NaN if x < 1 + * acosh(1) = +0 + * + */ + +/* Offsets for data table __svml_sacosh_data_internal + */ +#define sOne 0 +#define sPoly 32 +#define iBrkValue 288 +#define iOffExpoMask 320 +#define sBigThreshold 352 +#define sC2 384 +#define sC3 416 +#define sHalf 448 +#define sLargestFinite 480 +#define sThirtyOne 512 +#define sTopMask8 544 +#define XScale 576 +#define sLn2 608 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_acoshf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + +/* Load constants, always including One = 1 */ + vmovups sOne+__svml_sacosh_data_internal(%rip), %ymm2 + +/* Finally, express Y + W = U * V accurately where Y has <= 8 bits */ + vmovups sTopMask8+__svml_sacosh_data_internal(%rip), %ymm9 + +/* + * Now 1 / (1 + d) + * = 1 / (1 + (sqrt(1 - e) - 1)) + * = 1 / sqrt(1 - e) + * = 1 + 1/2 * e + 3/8 * e^2 + 5/16 * e^3 + 35/128 * e^4 + ... + * So compute the first three nonconstant terms of that, so that + * we have a relative correction (1 + Corr) to apply to S etc. + * C1 = 1/2 + * C2 = 3/8 + * C3 = 5/16 + */ + vmovups sC3+__svml_sacosh_data_internal(%rip), %ymm14 + vmovaps %ymm0, %ymm3 + vmovaps %ymm2, %ymm7 + vfmsub231ps %ymm3, %ymm3, %ymm7 + +/* + * Check that 1 < X < +inf; otherwise go to the callout function. + * We need the callout for X = 1 to avoid division by zero below. + * This test ensures that callout handles NaN and either infinity. + */ + vcmpnle_uqps sLargestFinite+__svml_sacosh_data_internal(%rip), %ymm3, %ymm4 + vcmpngt_uqps %ymm2, %ymm3, %ymm5 + +/* + * The following computation can go wrong for very large X, e.g. + * the X^2 - 1 = U * V can overflow. But for large X we have + * acosh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when to do this. + */ + vcmplt_oqps sBigThreshold+__svml_sacosh_data_internal(%rip), %ymm3, %ymm1 + vandps %ymm9, %ymm7, %ymm10 + +/* + * Compute R = 1/sqrt(Y + W) * (1 + d) + * Force R to <= 8 significant bits. + * This means that R * Y and R^2 * Y are exactly representable. + */ + vrsqrtps %ymm10, %ymm8 + vsubps %ymm10, %ymm7, %ymm11 + vandps %ymm9, %ymm8, %ymm12 + +/* + * Compute S = (Y/sqrt(Y + W)) * (1 + d) + * and T = (W/sqrt(Y + W)) * (1 + d) + * so that S + T = sqrt(Y + W) * (1 + d) + * S is exact, and the rounding error in T is OK. + */ + vmulps %ymm12, %ymm10, %ymm15 + vmulps %ymm11, %ymm12, %ymm0 + +/* Now multiplex to the case X = 2^-30 * input, Xl = 0 in the "big" case. */ + vmulps XScale+__svml_sacosh_data_internal(%rip), %ymm3, %ymm11 + +/* + * Compute e = -(2 * d + d^2) + * The first FMR is exact, and the rounding error in the other is acceptable + * since d and e are ~ 2^-8 + */ + vmovaps %ymm2, %ymm13 + vfnmadd231ps %ymm15, %ymm12, %ymm13 + vfnmadd231ps %ymm0, %ymm12, %ymm13 + vfmadd213ps sC2+__svml_sacosh_data_internal(%rip), %ymm13, %ymm14 + vfmadd213ps sHalf+__svml_sacosh_data_internal(%rip), %ymm13, %ymm14 + vmulps %ymm14, %ymm13, %ymm7 + vorps %ymm5, %ymm4, %ymm6 + +/* + * For low-accuracy versions, the computation can be done + * just as U + ((S + T) + (S + T) * Corr) + */ + vaddps %ymm0, %ymm15, %ymm5 + +/* sU is needed later on */ + vsubps %ymm2, %ymm3, %ymm4 + vfmadd213ps %ymm5, %ymm7, %ymm5 + vmovmskps %ymm6, %edx + vaddps %ymm5, %ymm4, %ymm6 + +/* + * Now resume the main code. + * reduction: compute r,n + */ + vmovups iBrkValue+__svml_sacosh_data_internal(%rip), %ymm4 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + vmaxps %ymm6, %ymm2, %ymm8 + vminps %ymm6, %ymm2, %ymm9 + vaddps %ymm9, %ymm8, %ymm12 + vblendvps %ymm1, %ymm12, %ymm11, %ymm14 + vsubps %ymm12, %ymm8, %ymm10 + vpsubd %ymm4, %ymm14, %ymm15 + vaddps %ymm10, %ymm9, %ymm13 + vpand iOffExpoMask+__svml_sacosh_data_internal(%rip), %ymm15, %ymm14 + vpsrad $23, %ymm15, %ymm15 + vpaddd %ymm4, %ymm14, %ymm8 + vpslld $23, %ymm15, %ymm5 + vmovups sPoly+224+__svml_sacosh_data_internal(%rip), %ymm4 + vcvtdq2ps %ymm15, %ymm0 + vpsubd %ymm5, %ymm2, %ymm7 + +/* polynomial evaluation */ + vsubps %ymm2, %ymm8, %ymm2 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + vaddps sThirtyOne+__svml_sacosh_data_internal(%rip), %ymm0, %ymm5 + vandps %ymm1, %ymm13, %ymm6 + vmulps %ymm7, %ymm6, %ymm9 + vblendvps %ymm1, %ymm0, %ymm5, %ymm0 + vaddps %ymm2, %ymm9, %ymm2 + vfmadd213ps sPoly+192+__svml_sacosh_data_internal(%rip), %ymm2, %ymm4 + vfmadd213ps sPoly+160+__svml_sacosh_data_internal(%rip), %ymm2, %ymm4 + vfmadd213ps sPoly+128+__svml_sacosh_data_internal(%rip), %ymm2, %ymm4 + vfmadd213ps sPoly+96+__svml_sacosh_data_internal(%rip), %ymm2, %ymm4 + vfmadd213ps sPoly+64+__svml_sacosh_data_internal(%rip), %ymm2, %ymm4 + vfmadd213ps sPoly+32+__svml_sacosh_data_internal(%rip), %ymm2, %ymm4 + vfmadd213ps sPoly+__svml_sacosh_data_internal(%rip), %ymm2, %ymm4 + vmulps %ymm4, %ymm2, %ymm6 + vfmadd213ps %ymm2, %ymm2, %ymm6 + +/* final reconstruction */ + vfmadd132ps sLn2+__svml_sacosh_data_internal(%rip), %ymm6, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm3, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call acoshf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_acoshf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_sacosh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 sOne[8][1]; + __declspec(align(32)) VUINT32 sPoly[8][8][1]; + __declspec(align(32)) VUINT32 iBrkValue[8][1]; + __declspec(align(32)) VUINT32 iOffExpoMask[8][1]; + __declspec(align(32)) VUINT32 sBigThreshold[8][1]; + __declspec(align(32)) VUINT32 sC2[8][1]; + __declspec(align(32)) VUINT32 sC3[8][1]; + __declspec(align(32)) VUINT32 sHalf[8][1]; + __declspec(align(32)) VUINT32 sLargestFinite[8][1]; + __declspec(align(32)) VUINT32 sThirtyOne[8][1]; + __declspec(align(32)) VUINT32 sTopMask8[8][1]; + __declspec(align(32)) VUINT32 XScale[8][1]; + __declspec(align(32)) VUINT32 sLn2[8][1]; +} __svml_sacosh_data_internal; +#endif +__svml_sacosh_data_internal: + /*== sOne = SP 1.0 ==*/ + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 32 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iBrkValue = SP 2/3 ==*/ + .align 32 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 32 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sBigThreshold ==*/ + .align 32 + .long 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000 + /*== sC2 ==*/ + .align 32 + .long 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000 + /*== sC3 ==*/ + .align 32 + .long 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000 + /*== sHalf ==*/ + .align 32 + .long 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000 + /*== sLargestFinite ==*/ + .align 32 + .long 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF + /*== sThirtyOne ==*/ + .align 32 + .long 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000 + /*== sTopMask8 ==*/ + .align 32 + .long 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000 + /*== XScale ==*/ + .align 32 + .long 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000 + /*== sLn2 = SP ln(2) ==*/ + .align 32 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 32 + .type __svml_sacosh_data_internal,@object + .size __svml_sacosh_data_internal,.-__svml_sacosh_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_acosh2_core.S b/sysdeps/x86_64/fpu/svml_d_acosh2_core.S new file mode 100644 index 0000000000..42bd5c1b5d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_acosh2_core.S @@ -0,0 +1,29 @@ +/* Function acosh vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_acosh) +WRAPPER_IMPL_SSE2 acosh +END (_ZGVbN2v_acosh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_acosh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_acosh4_core.S b/sysdeps/x86_64/fpu/svml_d_acosh4_core.S new file mode 100644 index 0000000000..433192bae1 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_acosh4_core.S @@ -0,0 +1,29 @@ +/* Function acosh vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_acosh) +WRAPPER_IMPL_AVX _ZGVbN2v_acosh +END (_ZGVdN4v_acosh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_acosh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_acosh4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_acosh4_core_avx.S new file mode 100644 index 0000000000..9e60289c45 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_acosh4_core_avx.S @@ -0,0 +1,25 @@ +/* Function acosh vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_acosh) +WRAPPER_IMPL_AVX _ZGVbN2v_acosh +END (_ZGVcN4v_acosh) diff --git a/sysdeps/x86_64/fpu/svml_d_acosh8_core.S b/sysdeps/x86_64/fpu/svml_d_acosh8_core.S new file mode 100644 index 0000000000..ef1f8b3426 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_acosh8_core.S @@ -0,0 +1,25 @@ +/* Function acosh vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_acosh) +WRAPPER_IMPL_AVX512 _ZGVdN4v_acosh +END (_ZGVeN8v_acosh) diff --git a/sysdeps/x86_64/fpu/svml_s_acoshf16_core.S b/sysdeps/x86_64/fpu/svml_s_acoshf16_core.S new file mode 100644 index 0000000000..41c0241492 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_acoshf16_core.S @@ -0,0 +1,25 @@ +/* Function acoshf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_acoshf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_acoshf +END (_ZGVeN16v_acoshf) diff --git a/sysdeps/x86_64/fpu/svml_s_acoshf4_core.S b/sysdeps/x86_64/fpu/svml_s_acoshf4_core.S new file mode 100644 index 0000000000..2ef7f428c0 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_acoshf4_core.S @@ -0,0 +1,29 @@ +/* Function acoshf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_acoshf) +WRAPPER_IMPL_SSE2 acoshf +END (_ZGVbN4v_acoshf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_acoshf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_acoshf8_core.S b/sysdeps/x86_64/fpu/svml_s_acoshf8_core.S new file mode 100644 index 0000000000..40f1066ce2 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_acoshf8_core.S @@ -0,0 +1,29 @@ +/* Function acoshf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_acoshf) +WRAPPER_IMPL_AVX _ZGVbN4v_acoshf +END (_ZGVdN8v_acoshf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_acoshf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_acoshf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_acoshf8_core_avx.S new file mode 100644 index 0000000000..b44a9ed28b --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_acoshf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function acoshf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_acoshf) +WRAPPER_IMPL_AVX _ZGVbN4v_acoshf +END (_ZGVcN8v_acoshf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx.c new file mode 100644 index 0000000000..331c6d71cc --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-acosh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx2.c new file mode 100644 index 0000000000..331c6d71cc --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-acosh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx512f.c new file mode 100644 index 0000000000..331c6d71cc --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-acosh-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-acosh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-acosh.c b/sysdeps/x86_64/fpu/test-double-libmvec-acosh.c new file mode 100644 index 0000000000..19b5997414 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-acosh.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC acosh +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 04a4fe654b..db7ae3e7a6 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVbN2v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVbN2v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVbN2v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVbN2v_atanh) +VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVbN2v_acosh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index f9ac2fad5d..269ae38f67 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVdN4v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVdN4v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVdN4v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVdN4v_atanh) +VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVdN4v_acosh) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 185801fa82..d95b960a45 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVcN4v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVcN4v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVcN4v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVcN4v_atanh) +VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVcN4v_acosh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 1cc8aaecbf..a22f08b5f8 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10), _ZGVeN8v_log10) VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVeN8v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVeN8v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVeN8v_atanh) +VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVeN8v_acosh) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx.c new file mode 100644 index 0000000000..7d75108bc0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-acoshf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx2.c new file mode 100644 index 0000000000..7d75108bc0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-acoshf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx512f.c new file mode 100644 index 0000000000..7d75108bc0 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-acoshf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-acoshf.c b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf.c new file mode 100644 index 0000000000..f8b536df2e --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-acoshf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC acoshf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index b5d76d80e0..7982ae2c84 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVeN16v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVeN16v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVeN16v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVeN16v_atanhf) +VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVeN16v_acoshf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index c1df6a03c1..bdfcbea2cd 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVbN4v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVbN4v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVbN4v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVbN4v_atanhf) +VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVbN4v_acoshf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index f4c646683f..7b3ba81441 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVdN8v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVdN8v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVdN8v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVdN8v_atanhf) +VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVdN8v_acoshf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index a6acd3ffca..a13d2e4ca1 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -42,6 +42,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log10f), _ZGVcN8v_log10f) VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVcN8v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVcN8v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVcN8v_atanhf) +VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVcN8v_acoshf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573816 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=I4eRu3Xz; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNm5j66Syz9sVq for ; Wed, 29 Dec 2021 07:16:41 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A28043858404 for ; Tue, 28 Dec 2021 20:16:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A28043858404 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640722599; bh=P8yPUY7TkRX6AOF9GHYQVYGFxNAV1++caSJ0N6HGOeI=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=I4eRu3Xz2B1V87oLtD5XikBPo6y9JrFLJ028WIvbTEDdcoi0Clp5mhQKFNT4mQ+Yu AU0qKVO3yXSAQ6hZfTsLlhzEhlqh2NQIte79CTDXspJHpBo+GKdp4P9fBC6PNUEkga nS22V2YZRfigf0uYfA4tuAQGvrtrBtO8SXf6Uufs= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by sourceware.org (Postfix) with ESMTPS id 5A020385841C for ; Tue, 28 Dec 2021 20:11:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5A020385841C X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="265648171" X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="265648171" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="686675347" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga005.jf.intel.com with ESMTP; 28 Dec 2021 12:11:33 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsk016522; Tue, 28 Dec 2021 12:11:32 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 16/18] x86-64: Add vector erf/erff implementation to libmvec Date: Tue, 28 Dec 2021 12:11:28 -0800 Message-Id: <20211228201130.737370-17-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized erf/erff containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector erf/erff with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 20 + .../fpu/multiarch/svml_d_erf2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_erf2_core.c | 27 + .../fpu/multiarch/svml_d_erf2_core_sse4.S | 987 ++++++++++++++++++ .../fpu/multiarch/svml_d_erf4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_erf4_core.c | 27 + .../fpu/multiarch/svml_d_erf4_core_avx2.S | 984 +++++++++++++++++ .../fpu/multiarch/svml_d_erf8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_erf8_core.c | 27 + .../fpu/multiarch/svml_d_erf8_core_avx512.S | 983 +++++++++++++++++ .../fpu/multiarch/svml_s_erff16_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_s_erff16_core.c | 28 + .../fpu/multiarch/svml_s_erff16_core_avx512.S | 185 ++++ .../fpu/multiarch/svml_s_erff4_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_s_erff4_core.c | 28 + .../fpu/multiarch/svml_s_erff4_core_sse4.S | 661 ++++++++++++ .../fpu/multiarch/svml_s_erff8_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_s_erff8_core.c | 28 + .../fpu/multiarch/svml_s_erff8_core_avx2.S | 666 ++++++++++++ sysdeps/x86_64/fpu/svml_d_erf2_core.S | 29 + sysdeps/x86_64/fpu/svml_d_erf4_core.S | 29 + sysdeps/x86_64/fpu/svml_d_erf4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_erf8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_erff16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_erff4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_erff8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_erff8_core_avx.S | 25 + .../x86_64/fpu/test-double-libmvec-erf-avx.c | 1 + .../x86_64/fpu/test-double-libmvec-erf-avx2.c | 1 + .../fpu/test-double-libmvec-erf-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-erf.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-erff-avx.c | 1 + .../x86_64/fpu/test-float-libmvec-erff-avx2.c | 1 + .../fpu/test-float-libmvec-erff-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-erff.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 5038 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_erf2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_erf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_erf4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_erf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_erff16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_erff4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_erff8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_erff8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-erf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-erf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-erf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-erf.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-erff-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-erff-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-erff-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-erff.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index b17bf78cd9..33d480031b 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -274,4 +274,15 @@ #define __DECL_SIMD_acoshf32x #define __DECL_SIMD_acoshf64x #define __DECL_SIMD_acoshf128x + +#define __DECL_SIMD_erf +#define __DECL_SIMD_erff +#define __DECL_SIMD_erfl +#define __DECL_SIMD_erff16 +#define __DECL_SIMD_erff32 +#define __DECL_SIMD_erff64 +#define __DECL_SIMD_erff128 +#define __DECL_SIMD_erff32x +#define __DECL_SIMD_erff64x +#define __DECL_SIMD_erff128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index bc37973c41..a5b6c4457f 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -228,7 +228,7 @@ __MATHCALL (yn,, (int, _Mdouble_)); #if defined __USE_XOPEN || defined __USE_ISOC99 /* Error and gamma functions. */ -__MATHCALL (erf,, (_Mdouble_)); +__MATHCALL_VEC (erf,, (_Mdouble_)); __MATHCALL (erfc,, (_Mdouble_)); __MATHCALL (lgamma,, (_Mdouble_)); #endif diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index e9d6ade70a..5525c8a0d6 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -53,6 +53,7 @@ GLIBC_2.35 _ZGVbN2v_atan F GLIBC_2.35 _ZGVbN2v_atanh F GLIBC_2.35 _ZGVbN2v_cbrt F GLIBC_2.35 _ZGVbN2v_cosh F +GLIBC_2.35 _ZGVbN2v_erf F GLIBC_2.35 _ZGVbN2v_exp10 F GLIBC_2.35 _ZGVbN2v_exp2 F GLIBC_2.35 _ZGVbN2v_expm1 F @@ -69,6 +70,7 @@ GLIBC_2.35 _ZGVbN4v_atanf F GLIBC_2.35 _ZGVbN4v_atanhf F GLIBC_2.35 _ZGVbN4v_cbrtf F GLIBC_2.35 _ZGVbN4v_coshf F +GLIBC_2.35 _ZGVbN4v_erff F GLIBC_2.35 _ZGVbN4v_exp10f F GLIBC_2.35 _ZGVbN4v_exp2f F GLIBC_2.35 _ZGVbN4v_expm1f F @@ -85,6 +87,7 @@ GLIBC_2.35 _ZGVcN4v_atan F GLIBC_2.35 _ZGVcN4v_atanh F GLIBC_2.35 _ZGVcN4v_cbrt F GLIBC_2.35 _ZGVcN4v_cosh F +GLIBC_2.35 _ZGVcN4v_erf F GLIBC_2.35 _ZGVcN4v_exp10 F GLIBC_2.35 _ZGVcN4v_exp2 F GLIBC_2.35 _ZGVcN4v_expm1 F @@ -101,6 +104,7 @@ GLIBC_2.35 _ZGVcN8v_atanf F GLIBC_2.35 _ZGVcN8v_atanhf F GLIBC_2.35 _ZGVcN8v_cbrtf F GLIBC_2.35 _ZGVcN8v_coshf F +GLIBC_2.35 _ZGVcN8v_erff F GLIBC_2.35 _ZGVcN8v_exp10f F GLIBC_2.35 _ZGVcN8v_exp2f F GLIBC_2.35 _ZGVcN8v_expm1f F @@ -117,6 +121,7 @@ GLIBC_2.35 _ZGVdN4v_atan F GLIBC_2.35 _ZGVdN4v_atanh F GLIBC_2.35 _ZGVdN4v_cbrt F GLIBC_2.35 _ZGVdN4v_cosh F +GLIBC_2.35 _ZGVdN4v_erf F GLIBC_2.35 _ZGVdN4v_exp10 F GLIBC_2.35 _ZGVdN4v_exp2 F GLIBC_2.35 _ZGVdN4v_expm1 F @@ -133,6 +138,7 @@ GLIBC_2.35 _ZGVdN8v_atanf F GLIBC_2.35 _ZGVdN8v_atanhf F GLIBC_2.35 _ZGVdN8v_cbrtf F GLIBC_2.35 _ZGVdN8v_coshf F +GLIBC_2.35 _ZGVdN8v_erff F GLIBC_2.35 _ZGVdN8v_exp10f F GLIBC_2.35 _ZGVdN8v_exp2f F GLIBC_2.35 _ZGVdN8v_expm1f F @@ -149,6 +155,7 @@ GLIBC_2.35 _ZGVeN16v_atanf F GLIBC_2.35 _ZGVeN16v_atanhf F GLIBC_2.35 _ZGVeN16v_cbrtf F GLIBC_2.35 _ZGVeN16v_coshf F +GLIBC_2.35 _ZGVeN16v_erff F GLIBC_2.35 _ZGVeN16v_exp10f F GLIBC_2.35 _ZGVeN16v_exp2f F GLIBC_2.35 _ZGVeN16v_expm1f F @@ -165,6 +172,7 @@ GLIBC_2.35 _ZGVeN8v_atan F GLIBC_2.35 _ZGVeN8v_atanh F GLIBC_2.35 _ZGVeN8v_cbrt F GLIBC_2.35 _ZGVeN8v_cosh F +GLIBC_2.35 _ZGVeN8v_erf F GLIBC_2.35 _ZGVeN8v_exp10 F GLIBC_2.35 _ZGVeN8v_exp2 F GLIBC_2.35 _ZGVeN8v_expm1 F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 4ad12a33e5..ea0deb31c1 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -122,6 +122,10 @@ # define __DECL_SIMD_acosh __DECL_SIMD_x86_64 # undef __DECL_SIMD_acoshf # define __DECL_SIMD_acoshf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_erf +# define __DECL_SIMD_erf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_erff +# define __DECL_SIMD_erff __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 503547d3e4..42addd9a25 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -60,6 +60,8 @@ !GCC$ builtin (atanhf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (acosh) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (acoshf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (erf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (erff) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -105,3 +107,5 @@ !GCC$ builtin (atanhf) attributes simd (notinbranch) if('x32') !GCC$ builtin (acosh) attributes simd (notinbranch) if('x32') !GCC$ builtin (acoshf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (erf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (erff) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 7b90b3d049..2b89a1bba3 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -31,6 +31,7 @@ libmvec-funcs = \ cbrt \ cos \ cosh \ + erf \ exp \ exp10 \ exp2 \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index fd5e5923a1..2fcdef6944 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -21,6 +21,7 @@ libmvec { _ZGVbN2v_atanh; _ZGVcN4v_atanh; _ZGVdN4v_atanh; _ZGVeN8v_atanh; _ZGVbN2v_cbrt; _ZGVcN4v_cbrt; _ZGVdN4v_cbrt; _ZGVeN8v_cbrt; _ZGVbN2v_cosh; _ZGVcN4v_cosh; _ZGVdN4v_cosh; _ZGVeN8v_cosh; + _ZGVbN2v_erf; _ZGVcN4v_erf; _ZGVdN4v_erf; _ZGVeN8v_erf; _ZGVbN2v_exp10; _ZGVcN4v_exp10; _ZGVdN4v_exp10; _ZGVeN8v_exp10; _ZGVbN2v_exp2; _ZGVcN4v_exp2; _ZGVdN4v_exp2; _ZGVeN8v_exp2; _ZGVbN2v_expm1; _ZGVcN4v_expm1; _ZGVdN4v_expm1; _ZGVeN8v_expm1; @@ -37,6 +38,7 @@ libmvec { _ZGVbN4v_atanhf; _ZGVcN8v_atanhf; _ZGVdN8v_atanhf; _ZGVeN16v_atanhf; _ZGVbN4v_cbrtf; _ZGVcN8v_cbrtf; _ZGVdN8v_cbrtf; _ZGVeN16v_cbrtf; _ZGVbN4v_coshf; _ZGVcN8v_coshf; _ZGVdN8v_coshf; _ZGVeN16v_coshf; + _ZGVbN4v_erff; _ZGVcN8v_erff; _ZGVdN8v_erff; _ZGVeN16v_erff; _ZGVbN4v_exp10f; _ZGVcN8v_exp10f; _ZGVdN8v_exp10f; _ZGVeN16v_exp10f; _ZGVbN4v_exp2f; _ZGVcN8v_exp2f; _ZGVdN8v_exp2f; _ZGVeN16v_exp2f; _ZGVbN4v_expm1f; _ZGVcN8v_expm1f; _ZGVdN8v_expm1f; _ZGVeN16v_expm1f; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index b2aa8fc56e..929de0e786 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -1298,6 +1298,26 @@ float: 1 float128: 2 ldouble: 1 +Function: "erf_vlen16": +float: 1 + +Function: "erf_vlen2": +double: 1 + +Function: "erf_vlen4": +double: 1 +float: 2 + +Function: "erf_vlen4_avx2": +double: 1 + +Function: "erf_vlen8": +double: 1 +float: 2 + +Function: "erf_vlen8_avx2": +float: 2 + Function: "erfc": double: 5 float: 3 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core-sse2.S new file mode 100644 index 0000000000..2b5735ebb3 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized erf, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_erf _ZGVbN2v_erf_sse2 +#include "../svml_d_erf2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core.c new file mode 100644 index 0000000000..74757be88f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized erf, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_erf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_erf, __GI__ZGVbN2v_erf, __redirect__ZGVbN2v_erf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core_sse4.S new file mode 100644 index 0000000000..c164748bbe --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf2_core_sse4.S @@ -0,0 +1,987 @@ +/* Function erf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Basic formula is + * erf(x) ~ erf(x0) + + * + exp(-x0*x0)*D*(1+c0+T*P1(T)+D^2*P3(T)+D^4*P5(T)+D^6*p7+D^8*p9) + * where D=x-x0, T=x0*D + * x0 is x rounded to a specified number of fractional bits (in this case 7), + * except that x0=0 for |x|<3.5/128.0 (using x0=0 for first 4 table entries) + * + * Data table packs both erf(x0)_high and a few bits of erf(x0)_low in one + * entry (in place of redundant exponent bits) + * + */ + +/* Offsets for data table __svml_derf_data_internal + */ +#define _erf_tbl 0 +#define _AbsMask 12288 +#define _MaxThreshold 12304 +#define _SRound 12320 +#define _U2Threshold 12336 +#define _poly1_0 12352 +#define _poly1_1 12368 +#define _poly3_0 12384 +#define _poly3_1 12400 +#define _poly5_0 12416 +#define _poly5_1 12432 +#define _poly1_2 12448 +#define _poly3_2 12464 +#define _poly1_3 12480 +#define _poly3_3 12496 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_erf_sse4) +/* + * vector gather: erf(x0), + * second value is exp(-x0*x0) + */ + lea __svml_derf_data_internal(%rip), %rcx + movups _AbsMask+__svml_derf_data_internal(%rip), %xmm5 + andps %xmm0, %xmm5 + +/* + * erf(x) rounds to 1.0 for x>_MaxThreshold (5.9921875) + * can compute all results in the main path + */ + movaps %xmm5, %xmm9 + +/* save sign */ + pxor %xmm5, %xmm0 + minpd _MaxThreshold+__svml_derf_data_internal(%rip), %xmm9 + movups _SRound+__svml_derf_data_internal(%rip), %xmm1 + movaps %xmm1, %xmm2 + addpd %xmm9, %xmm2 + movaps %xmm2, %xmm8 + psllq $4, %xmm2 + subpd %xmm1, %xmm8 + movd %xmm2, %eax + movups _U2Threshold+__svml_derf_data_internal(%rip), %xmm11 + cmpltpd %xmm9, %xmm11 + subpd %xmm8, %xmm9 + mulpd %xmm9, %xmm8 + +/* + * _LA_ polynomial computation + * Start polynomial evaluation + */ + movups _poly1_0+__svml_derf_data_internal(%rip), %xmm7 + andps %xmm9, %xmm11 + mulpd %xmm8, %xmm7 + +/* D2 = Diff^2 */ + mulpd %xmm11, %xmm11 + addpd _poly1_1+__svml_derf_data_internal(%rip), %xmm7 + +/* NaN fixup */ + minpd %xmm5, %xmm9 + mulpd %xmm8, %xmm7 + movups _poly3_0+__svml_derf_data_internal(%rip), %xmm6 + +/* T^2 */ + movaps %xmm8, %xmm12 + mulpd %xmm8, %xmm6 + addpd _poly1_2+__svml_derf_data_internal(%rip), %xmm7 + addpd _poly3_1+__svml_derf_data_internal(%rip), %xmm6 + mulpd %xmm8, %xmm12 + mulpd %xmm8, %xmm6 + mulpd %xmm8, %xmm7 + addpd _poly3_2+__svml_derf_data_internal(%rip), %xmm6 + addpd _poly1_3+__svml_derf_data_internal(%rip), %xmm7 + mulpd %xmm8, %xmm6 + +/* P1 = T^2*P1 - T */ + mulpd %xmm7, %xmm12 + movups _poly5_0+__svml_derf_data_internal(%rip), %xmm10 + +/* Sign | Diff */ + pxor %xmm0, %xmm9 + mulpd %xmm8, %xmm10 + subpd %xmm8, %xmm12 + addpd _poly5_1+__svml_derf_data_internal(%rip), %xmm10 + mulpd %xmm11, %xmm10 + addpd _poly3_3+__svml_derf_data_internal(%rip), %xmm10 + addpd %xmm6, %xmm10 + pshufd $2, %xmm2, %xmm3 + movd %xmm3, %edx + +/* P1 + P3*D2 */ + mulpd %xmm10, %xmm11 + movslq %eax, %rax + movslq %edx, %rdx + addpd %xmm11, %xmm12 + movups (%rcx,%rax), %xmm13 + movups (%rcx,%rdx), %xmm4 + movaps %xmm13, %xmm14 + unpckhpd %xmm4, %xmm13 + +/* exp_h(x0) * Diff */ + mulpd %xmm9, %xmm13 + +/* + * branch-free + * low part of result: exp_h(x0) * Diff*(1+P1) + */ + mulpd %xmm13, %xmm12 + addpd %xmm12, %xmm13 + unpcklpd %xmm4, %xmm14 + +/* Sign | _Erf_H */ + pxor %xmm0, %xmm14 + +/* Final result */ + addpd %xmm13, %xmm14 + +/* Fix erf(-0) = -0 */ + orps %xmm14, %xmm0 + ret + +END(_ZGVbN2v_erf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_derf_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _erf_tbl[6*128*2][2]; + __declspec(align(16)) VUINT32 _AbsMask[2][2]; + __declspec(align(16)) VUINT32 _MaxThreshold[2][2]; + __declspec(align(16)) VUINT32 _SRound[2][2]; + __declspec(align(16)) VUINT32 _U2Threshold[2][2]; + __declspec(align(16)) VUINT32 _poly1_0[2][2]; + __declspec(align(16)) VUINT32 _poly1_1[2][2]; + __declspec(align(16)) VUINT32 _poly3_0[2][2]; + __declspec(align(16)) VUINT32 _poly3_1[2][2]; + __declspec(align(16)) VUINT32 _poly5_0[2][2]; + __declspec(align(16)) VUINT32 _poly5_1[2][2]; + __declspec(align(16)) VUINT32 _poly1_2[2][2]; + __declspec(align(16)) VUINT32 _poly3_2[2][2]; + __declspec(align(16)) VUINT32 _poly1_3[2][2]; + __declspec(align(16)) VUINT32 _poly3_3[2][2]; +} __svml_derf_data_internal; +#endif +__svml_derf_data_internal: + /*== _erf_tbl ==*/ + .quad 0x0000000000000000, 0x3ff20dd750429b6d + .quad 0x3f820dbf3deb1340, 0x3ff20d8f1975c85d + .quad 0x3f920d77083f17a0, 0x3ff20cb67bd452c7 + .quad 0x3f9b137e0cf584dc, 0x3ff20b4d8bac36c1 + .quad 0x3fa20c5645dd2538, 0x3ff209546ad13ccf + .quad 0x3fa68e5d3bbc9526, 0x3ff206cb4897b148 + .quad 0x3fab0fafef135745, 0x3ff203b261cd0053 + .quad 0x3faf902a77bd3821, 0x3ff2000a00ae3804 + .quad 0x3fb207d480e90658, 0x3ff1fbd27cdc72d3 + .quad 0x3fb44703e87e8593, 0x3ff1f70c3b4f2cc8 + .quad 0x3fb68591a1e83b5d, 0x3ff1f1b7ae44867f + .quad 0x3fb8c36beb8a8d23, 0x3ff1ebd5552f795b + .quad 0x3fbb0081148a873a, 0x3ff1e565bca400d4 + .quad 0x3fbd3cbf7e70a4b3, 0x3ff1de697e413d29 + .quad 0x3fbf78159ec8bb50, 0x3ff1d6e14099944a + .quad 0x3fc0d939005f65e5, 0x3ff1cecdb718d61c + .quad 0x3fc1f5e1a35c3b89, 0x3ff1c62fa1e869b6 + .quad 0x3fc311fc15f56d14, 0x3ff1bd07cdd189ac + .quad 0x3fc42d7fc2f64959, 0x3ff1b357141d95d5 + .quad 0x3fc548642321d7c6, 0x3ff1a91e5a748165 + .quad 0x3fc662a0bdf7a89f, 0x3ff19e5e92b964ab + .quad 0x3fc77c2d2a765f9e, 0x3ff19318bae53a04 + .quad 0x3fc895010fdbdbfd, 0x3ff1874ddcdfce24 + .quad 0x3fc9ad142662e14d, 0x3ff17aff0e56ec10 + .quad 0x3fcac45e37fe2526, 0x3ff16e2d7093cd8c + .quad 0x3fcbdad72110a648, 0x3ff160da304ed92f + .quad 0x3fccf076d1233237, 0x3ff153068581b781 + .quad 0x3fce05354b96ff36, 0x3ff144b3b337c90c + .quad 0x3fcf190aa85540e2, 0x3ff135e3075d076b + .quad 0x3fd015f78a3dcf3d, 0x3ff12695da8b5bde + .quad 0x3fd09eed6982b948, 0x3ff116cd8fd67618 + .quad 0x3fd127631eb8de32, 0x3ff1068b94962e5e + .quad 0x3fd1af54e232d609, 0x3ff0f5d1602f7e41 + .quad 0x3fd236bef825d9a2, 0x3ff0e4a073dc1b91 + .quad 0x3fd2bd9db0f7827f, 0x3ff0d2fa5a70c168 + .quad 0x3fd343ed6989b7d9, 0x3ff0c0e0a8223359 + .quad 0x3fd3c9aa8b84beda, 0x3ff0ae54fa490723 + .quad 0x3fd44ed18d9f6462, 0x3ff09b58f724416b + .quad 0x3fd4d35ef3e5372e, 0x3ff087ee4d9ad247 + .quad 0x3fd5574f4ffac98e, 0x3ff07416b4fbfe7c + .quad 0x3fd5da9f415ff23f, 0x3ff05fd3ecbec298 + .quad 0x3fd65d4b75b00471, 0x3ff04b27bc403d30 + .quad 0x3fd6df50a8dff772, 0x3ff03613f2812daf + .quad 0x3fd760aba57a76bf, 0x3ff0209a65e29545 + .quad 0x3fd7e15944d9d3e4, 0x3ff00abcf3e187a9 + .quad 0x3fd861566f5fd3c0, 0x3fefe8fb01a47307 + .quad 0x3fd8e0a01cab516b, 0x3fefbbbbef34b4b2 + .quad 0x3fd95f3353cbb146, 0x3fef8dc092d58ff8 + .quad 0x3fd9dd0d2b721f39, 0x3fef5f0cdaf15313 + .quad 0x3fda5a2aca209394, 0x3fef2fa4c16c0019 + .quad 0x3fdad68966569a87, 0x3feeff8c4b1375db + .quad 0x3fdb522646bbda68, 0x3feecec7870ebca8 + .quad 0x3fdbccfec24855b8, 0x3fee9d5a8e4c934e + .quad 0x3fdc4710406a65fc, 0x3fee6b4982f158b9 + .quad 0x3fdcc058392a6d2d, 0x3fee38988fc46e72 + .quad 0x3fdd38d4354c3bd0, 0x3fee054be79d3042 + .quad 0x3fddb081ce6e2a48, 0x3fedd167c4cf9d2a + .quad 0x3fde275eaf25e458, 0x3fed9cf06898cdaf + .quad 0x3fde9d68931ae650, 0x3fed67ea1a8b5368 + .quad 0x3fdf129d471eabb1, 0x3fed325927fb9d89 + .quad 0x3fdf86faa9428f9d, 0x3fecfc41e36c7df9 + .quad 0x3fdffa7ea8eb5fd0, 0x3fecc5a8a3fbea40 + .quad 0x3fe03693a371519c, 0x3fec8e91c4d01368 + .quad 0x3fe06f794ab2cae7, 0x3fec5701a484ef9d + .quad 0x3fe0a7ef5c18edd2, 0x3fec1efca49a5011 + .quad 0x3fe0dff4f247f6c6, 0x3febe68728e29d5e + .quad 0x3fe1178930ada115, 0x3febada596f25436 + .quad 0x3fe14eab43841b55, 0x3feb745c55905bf8 + .quad 0x3fe1855a5fd3dd50, 0x3feb3aafcc27502e + .quad 0x3fe1bb95c3746199, 0x3feb00a46237d5be + .quad 0x3fe1f15cb50bc4de, 0x3feac63e7ecc1411 + .quad 0x3fe226ae840d4d70, 0x3fea8b8287ec6a09 + .quad 0x3fe25b8a88b6dd7f, 0x3fea5074e2157620 + .quad 0x3fe28ff0240d52cd, 0x3fea1519efaf889e + .quad 0x3fe2c3debfd7d6c1, 0x3fe9d97610879642 + .quad 0x3fe2f755ce9a21f4, 0x3fe99d8da149c13f + .quad 0x3fe32a54cb8db67b, 0x3fe96164fafd8de3 + .quad 0x3fe35cdb3a9a144d, 0x3fe925007283d7aa + .quad 0x3fe38ee8a84beb71, 0x3fe8e86458169af8 + .quad 0x3fe3c07ca9cb4f9e, 0x3fe8ab94f6caa71d + .quad 0x3fe3f196dcd0f135, 0x3fe86e9694134b9e + .quad 0x3fe42236e79a5fa6, 0x3fe8316d6f48133d + .quad 0x3fe4525c78dd5966, 0x3fe7f41dc12c9e89 + .quad 0x3fe4820747ba2dc2, 0x3fe7b6abbb7aaf19 + .quad 0x3fe4b13713ad3513, 0x3fe7791b886e7403 + .quad 0x3fe4dfeba47f63cc, 0x3fe73b714a552763 + .quad 0x3fe50e24ca35fd2c, 0x3fe6fdb11b1e0c34 + .quad 0x3fe53be25d016a4f, 0x3fe6bfdf0beddaf5 + .quad 0x3fe569243d2b3a9b, 0x3fe681ff24b4ab04 + .quad 0x3fe595ea53035283, 0x3fe6441563c665d4 + .quad 0x3fe5c2348ecc4dc3, 0x3fe60625bd75d07b + .quad 0x3fe5ee02e8a71a53, 0x3fe5c8341bb23767 + .quad 0x3fe61955607dd15d, 0x3fe58a445da7c74c + .quad 0x3fe6442bfdedd397, 0x3fe54c5a57629db0 + .quad 0x3fe66e86d0312e82, 0x3fe50e79d1749ac9 + .quad 0x3fe69865ee075011, 0x3fe4d0a6889dfd9f + .quad 0x3fe6c1c9759d0e5f, 0x3fe492e42d78d2c5 + .quad 0x3fe6eab18c74091b, 0x3fe4553664273d24 + .quad 0x3fe7131e5f496a5a, 0x3fe417a0c4049fd0 + .quad 0x3fe73b1021fc0cb8, 0x3fe3da26d759aef5 + .quad 0x3fe762870f720c6f, 0x3fe39ccc1b136d5a + .quad 0x3fe78983697dc96f, 0x3fe35f93fe7d1b3d + .quad 0x3fe7b00578c26037, 0x3fe32281e2fd1a92 + .quad 0x3fe7d60d8c979f7b, 0x3fe2e5991bd4cbfc + .quad 0x3fe7fb9bfaed8078, 0x3fe2a8dcede3673b + .quad 0x3fe820b1202f27fb, 0x3fe26c508f6bd0ff + .quad 0x3fe8454d5f25760d, 0x3fe22ff727dd6f7b + .quad 0x3fe8697120d92a4a, 0x3fe1f3d3cf9ffe5a + .quad 0x3fe88d1cd474a2e0, 0x3fe1b7e98fe26217 + .quad 0x3fe8b050ef253c37, 0x3fe17c3b626c7a12 + .quad 0x3fe8d30debfc572e, 0x3fe140cc3173f007 + .quad 0x3fe8f5544bd00c04, 0x3fe1059ed7740313 + .quad 0x3fe91724951b8fc6, 0x3fe0cab61f084b93 + .quad 0x3fe9387f53df5238, 0x3fe09014c2ca74da + .quad 0x3fe959651980da31, 0x3fe055bd6d32e8d7 + .quad 0x3fe979d67caa6631, 0x3fe01bb2b87c6968 + .quad 0x3fe999d4192a5715, 0x3fdfc3ee5d1524b0 + .quad 0x3fe9b95e8fd26aba, 0x3fdf511a91a67d2a + .quad 0x3fe9d8768656cc42, 0x3fdedeeee0959518 + .quad 0x3fe9f71ca72cffb6, 0x3fde6d6ffaa65a25 + .quad 0x3fea1551a16aaeaf, 0x3fddfca26f5bbf88 + .quad 0x3fea331628a45b92, 0x3fdd8c8aace11e63 + .quad 0x3fea506af4cc00f4, 0x3fdd1d2cfff91594 + .quad 0x3fea6d50c20fa293, 0x3fdcae8d93f1d7b7 + .quad 0x3fea89c850b7d54d, 0x3fdc40b0729ed548 + .quad 0x3feaa5d265064366, 0x3fdbd3998457afdb + .quad 0x3feac16fc7143263, 0x3fdb674c8ffc6283 + .quad 0x3feadca142b10f98, 0x3fdafbcd3afe8ab6 + .quad 0x3feaf767a741088b, 0x3fda911f096fbc26 + .quad 0x3feb11c3c79bb424, 0x3fda27455e14c93c + .quad 0x3feb2bb679ead19c, 0x3fd9be437a7de946 + .quad 0x3feb4540978921ee, 0x3fd9561c7f23a47b + .quad 0x3feb5e62fce16095, 0x3fd8eed36b886d93 + .quad 0x3feb771e894d602e, 0x3fd8886b1e5ecfd1 + .quad 0x3feb8f741ef54f83, 0x3fd822e655b417e7 + .quad 0x3feba764a2af2b78, 0x3fd7be47af1f5d89 + .quad 0x3febbef0fbde6221, 0x3fd75a91a7f4d2ed + .quad 0x3febd61a1453ab44, 0x3fd6f7c69d7d3ef8 + .quad 0x3febece0d82d1a5c, 0x3fd695e8cd31867e + .quad 0x3fec034635b66e23, 0x3fd634fa54fa285f + .quad 0x3fec194b1d49a184, 0x3fd5d4fd33729015 + .quad 0x3fec2ef0812fc1bd, 0x3fd575f3483021c3 + .quad 0x3fec443755820d64, 0x3fd517de540ce2a3 + .quad 0x3fec5920900b5fd1, 0x3fd4babff975a04c + .quad 0x3fec6dad2829ec62, 0x3fd45e99bcbb7915 + .quad 0x3fec81de16b14cef, 0x3fd4036d0468a7a2 + .quad 0x3fec95b455cce69d, 0x3fd3a93b1998736c + .quad 0x3feca930e0e2a825, 0x3fd35005285227f1 + .quad 0x3fecbc54b476248d, 0x3fd2f7cc3fe6f423 + .quad 0x3feccf20ce0c0d27, 0x3fd2a09153529381 + .quad 0x3fece1962c0e0d8b, 0x3fd24a55399ea239 + .quad 0x3fecf3b5cdaf0c39, 0x3fd1f518ae487dc8 + .quad 0x3fed0580b2cfd249, 0x3fd1a0dc51a9934d + .quad 0x3fed16f7dbe41ca0, 0x3fd14da0a961fd14 + .quad 0x3fed281c49d818d0, 0x3fd0fb6620c550af + .quad 0x3fed38eefdf64fdd, 0x3fd0aa2d09497f2b + .quad 0x3fed4970f9ce00d9, 0x3fd059f59af7a906 + .quad 0x3fed59a33f19ed42, 0x3fd00abff4dec7a3 + .quad 0x3fed6986cfa798e7, 0x3fcf79183b101c5b + .quad 0x3fed791cad3eff01, 0x3fcedeb406d9c825 + .quad 0x3fed8865d98abe01, 0x3fce4652fadcb6b2 + .quad 0x3fed97635600bb89, 0x3fcdaff4969c0b04 + .quad 0x3feda61623cb41e0, 0x3fcd1b982c501370 + .quad 0x3fedb47f43b2980d, 0x3fcc893ce1dcbef7 + .quad 0x3fedc29fb60715af, 0x3fcbf8e1b1ca2279 + .quad 0x3fedd0787a8bb39d, 0x3fcb6a856c3ed54f + .quad 0x3fedde0a90611a0d, 0x3fcade26b7fbed95 + .quad 0x3fedeb56f5f12d28, 0x3fca53c4135a6526 + .quad 0x3fedf85ea8db188e, 0x3fc9cb5bd549b111 + .quad 0x3fee0522a5dfda73, 0x3fc944ec2e4f5630 + .quad 0x3fee11a3e8cf4eb8, 0x3fc8c07329874652 + .quad 0x3fee1de36c75ba58, 0x3fc83deeada4d25a + .quad 0x3fee29e22a89d766, 0x3fc7bd5c7df3fe9c + .quad 0x3fee35a11b9b61ce, 0x3fc73eba3b5b07b7 + .quad 0x3fee4121370224cc, 0x3fc6c205655be720 + .quad 0x3fee4c6372cd8927, 0x3fc6473b5b15a7a1 + .quad 0x3fee5768c3b4a3fc, 0x3fc5ce595c455b0a + .quad 0x3fee62321d06c5e0, 0x3fc5575c8a468362 + .quad 0x3fee6cc0709c8a0d, 0x3fc4e241e912c305 + .quad 0x3fee7714aec96534, 0x3fc46f066040a832 + .quad 0x3fee812fc64db369, 0x3fc3fda6bc016994 + .quad 0x3fee8b12a44944a8, 0x3fc38e1fae1d6a9d + .quad 0x3fee94be342e6743, 0x3fc3206dceef5f87 + .quad 0x3fee9e335fb56f87, 0x3fc2b48d9e5dea1c + .quad 0x3feea7730ed0bbb9, 0x3fc24a7b84d38971 + .quad 0x3feeb07e27a133aa, 0x3fc1e233d434b813 + .quad 0x3feeb9558e6b42ce, 0x3fc17bb2c8d41535 + .quad 0x3feec1fa258c4bea, 0x3fc116f48a6476cc + .quad 0x3feeca6ccd709544, 0x3fc0b3f52ce8c383 + .quad 0x3feed2ae6489ac1e, 0x3fc052b0b1a174ea + .quad 0x3feedabfc7453e63, 0x3fbfe6460fef4680 + .quad 0x3feee2a1d004692c, 0x3fbf2a901ccafb37 + .quad 0x3feeea5557137ae0, 0x3fbe723726b824a9 + .quad 0x3feef1db32a2277c, 0x3fbdbd32ac4c99b0 + .quad 0x3feef93436bc2daa, 0x3fbd0b7a0f921e7c + .quad 0x3fef006135426b26, 0x3fbc5d0497c09e74 + .quad 0x3fef0762fde45ee6, 0x3fbbb1c972f23e50 + .quad 0x3fef0e3a5e1a1788, 0x3fbb09bfb7d11a84 + .quad 0x3fef14e8211e8c55, 0x3fba64de673e8837 + .quad 0x3fef1b6d0fea5f4d, 0x3fb9c31c6df3b1b8 + .quad 0x3fef21c9f12f0677, 0x3fb92470a61b6965 + .quad 0x3fef27ff89525acf, 0x3fb888d1d8e510a3 + .quad 0x3fef2e0e9a6a8b09, 0x3fb7f036c0107294 + .quad 0x3fef33f7e43a706b, 0x3fb75a96077274ba + .quad 0x3fef39bc242e43e6, 0x3fb6c7e64e7281cb + .quad 0x3fef3f5c1558b19e, 0x3fb6381e2980956b + .quad 0x3fef44d870704911, 0x3fb5ab342383d178 + .quad 0x3fef4a31ebcd47df, 0x3fb5211ebf41880b + .quad 0x3fef4f693b67bd77, 0x3fb499d478bca735 + .quad 0x3fef547f10d60597, 0x3fb4154bc68d75c3 + .quad 0x3fef59741b4b97cf, 0x3fb3937b1b31925a + .quad 0x3fef5e4907982a07, 0x3fb31458e6542847 + .quad 0x3fef62fe80272419, 0x3fb297db960e4f63 + .quad 0x3fef67952cff6282, 0x3fb21df9981f8e53 + .quad 0x3fef6c0db3c34641, 0x3fb1a6a95b1e786f + .quad 0x3fef7068b7b10fd9, 0x3fb131e14fa1625d + .quad 0x3fef74a6d9a38383, 0x3fb0bf97e95f2a64 + .quad 0x3fef78c8b812d498, 0x3fb04fc3a0481321 + .quad 0x3fef7cceef15d631, 0x3fafc4b5e32d6259 + .quad 0x3fef80ba18636f07, 0x3faeeea8c1b1db94 + .quad 0x3fef848acb544e95, 0x3fae1d4cf1e2450a + .quad 0x3fef88419ce4e184, 0x3fad508f9a1ea64f + .quad 0x3fef8bdf1fb78370, 0x3fac885df3451a07 + .quad 0x3fef8f63e416ebff, 0x3fabc4a54a84e834 + .quad 0x3fef92d077f8d56d, 0x3fab055303221015 + .quad 0x3fef96256700da8e, 0x3faa4a549829587e + .quad 0x3fef99633a838a57, 0x3fa993979e14fffe + .quad 0x3fef9c8a7989af0d, 0x3fa8e109c4622913 + .quad 0x3fef9f9ba8d3c733, 0x3fa83298d717210e + .quad 0x3fefa2974addae45, 0x3fa78832c03aa2b1 + .quad 0x3fefa57ddfe27376, 0x3fa6e1c5893c380b + .quad 0x3fefa84fe5e05c8d, 0x3fa63f3f5c4de13b + .quad 0x3fefab0dd89d1309, 0x3fa5a08e85af27e0 + .quad 0x3fefadb831a9f9c3, 0x3fa505a174e9c929 + .quad 0x3fefb04f6868a944, 0x3fa46e66be002240 + .quad 0x3fefb2d3f20f9101, 0x3fa3dacd1a8d8cce + .quad 0x3fefb54641aebbc9, 0x3fa34ac36ad8dafe + .quad 0x3fefb7a6c834b5a2, 0x3fa2be38b6d92415 + .quad 0x3fefb9f5f4739170, 0x3fa2351c2f2d1449 + .quad 0x3fefbc3433260ca5, 0x3fa1af5d2e04f3f6 + .quad 0x3fefbe61eef4cf6a, 0x3fa12ceb37ff9bc3 + .quad 0x3fefc07f907bc794, 0x3fa0adb5fcfa8c75 + .quad 0x3fefc28d7e4f9cd0, 0x3fa031ad58d56279 + .quad 0x3fefc48c1d033c7a, 0x3f9f7182a851bca2 + .quad 0x3fefc67bcf2d7b8f, 0x3f9e85c449e377f3 + .quad 0x3fefc85cf56ecd38, 0x3f9da0005e5f28df + .quad 0x3fefca2fee770c79, 0x3f9cc0180af00a8b + .quad 0x3fefcbf5170b578b, 0x3f9be5ecd2fcb5f9 + .quad 0x3fefcdacca0bfb73, 0x3f9b1160991ff737 + .quad 0x3fefcf57607a6e7c, 0x3f9a4255a00b9f03 + .quad 0x3fefd0f5317f582f, 0x3f9978ae8b55ce1b + .quad 0x3fefd2869270a56f, 0x3f98b44e6031383e + .quad 0x3fefd40bd6d7a785, 0x3f97f5188610ddc8 + .quad 0x3fefd58550773cb5, 0x3f973af0c737bb45 + .quad 0x3fefd6f34f52013a, 0x3f9685bb5134ef13 + .quad 0x3fefd85621b0876d, 0x3f95d55cb54cd53a + .quad 0x3fefd9ae142795e3, 0x3f9529b9e8cf9a1e + .quad 0x3fefdafb719e6a69, 0x3f9482b8455dc491 + .quad 0x3fefdc3e835500b3, 0x3f93e03d891b37de + .quad 0x3fefdd7790ea5bc0, 0x3f93422fd6d12e2b + .quad 0x3fefdea6e062d0c9, 0x3f92a875b5ffab56 + .quad 0x3fefdfccb62e52d3, 0x3f9212f612dee7fb + .quad 0x3fefe0e9552ebdd6, 0x3f9181983e5133dd + .quad 0x3fefe1fcfebe2083, 0x3f90f443edc5ce49 + .quad 0x3fefe307f2b503d0, 0x3f906ae13b0d3255 + .quad 0x3fefe40a6f70af4b, 0x3f8fcab1483ea7fc + .quad 0x3fefe504b1d9696c, 0x3f8ec72615a894c4 + .quad 0x3fefe5f6f568b301, 0x3f8dcaf3691fc448 + .quad 0x3fefe6e1742f7cf6, 0x3f8cd5ec93c12432 + .quad 0x3fefe7c466dc57a1, 0x3f8be7e5ac24963b + .quad 0x3fefe8a004c19ae6, 0x3f8b00b38d6b3575 + .quad 0x3fefe97483db8670, 0x3f8a202bd6372dce + .quad 0x3fefea4218d6594a, 0x3f894624e78e0faf + .quad 0x3fefeb08f7146046, 0x3f887275e3a6869e + .quad 0x3fefebc950b3fa75, 0x3f87a4f6aca256cb + .quad 0x3fefec835695932e, 0x3f86dd7fe3358230 + .quad 0x3fefed37386190fb, 0x3f861beae53b72b7 + .quad 0x3fefede5248e38f4, 0x3f856011cc3b036d + .quad 0x3fefee8d486585ee, 0x3f84a9cf6bda3f4c + .quad 0x3fefef2fd00af31a, 0x3f83f8ff5042a88e + .quad 0x3fefefcce6813974, 0x3f834d7dbc76d7e5 + .quad 0x3feff064b5afffbe, 0x3f82a727a89a3f14 + .quad 0x3feff0f766697c76, 0x3f8205dac02bd6b9 + .quad 0x3feff18520700971, 0x3f81697560347b26 + .quad 0x3feff20e0a7ba8c2, 0x3f80d1d69569b82d + .quad 0x3feff2924a3f7a83, 0x3f803ede1a45bfee + .quad 0x3feff312046f2339, 0x3f7f60d8aa2a88f2 + .quad 0x3feff38d5cc4227f, 0x3f7e4cc4abf7d065 + .quad 0x3feff404760319b4, 0x3f7d4143a9dfe965 + .quad 0x3feff47772010262, 0x3f7c3e1a5f5c077c + .quad 0x3feff4e671a85425, 0x3f7b430ecf4a83a8 + .quad 0x3feff55194fe19df, 0x3f7a4fe83fb9db25 + .quad 0x3feff5b8fb26f5f6, 0x3f79646f35a76624 + .quad 0x3feff61cc26c1578, 0x3f78806d70b2fc36 + .quad 0x3feff67d08401202, 0x3f77a3ade6c8b3e5 + .quad 0x3feff6d9e943c231, 0x3f76cdfcbfc1e263 + .quad 0x3feff733814af88c, 0x3f75ff2750fe7820 + .quad 0x3feff789eb6130c9, 0x3f7536fc18f7ce5c + .quad 0x3feff7dd41ce2b4d, 0x3f74754abacdf1dc + .quad 0x3feff82d9e1a76d8, 0x3f73b9e3f9d06e3f + .quad 0x3feff87b1913e853, 0x3f730499b503957f + .quad 0x3feff8c5cad200a5, 0x3f72553ee2a336bf + .quad 0x3feff90dcaba4096, 0x3f71aba78ba3af89 + .quad 0x3feff9532f846ab0, 0x3f7107a8c7323a6e + .quad 0x3feff9960f3eb327, 0x3f706918b6355624 + .quad 0x3feff9d67f51ddba, 0x3f6f9f9cfd9c3035 + .quad 0x3feffa14948549a7, 0x3f6e77448fb66bb9 + .quad 0x3feffa506302ebae, 0x3f6d58da68fd1170 + .quad 0x3feffa89fe5b3625, 0x3f6c4412bf4b8f0b + .quad 0x3feffac17988ef4b, 0x3f6b38a3af2e55b4 + .quad 0x3feffaf6e6f4f5c0, 0x3f6a3645330550ff + .quad 0x3feffb2a5879f35e, 0x3f693cb11a30d765 + .quad 0x3feffb5bdf67fe6f, 0x3f684ba3004a50d0 + .quad 0x3feffb8b8c88295f, 0x3f6762d84469c18f + .quad 0x3feffbb970200110, 0x3f66821000795a03 + .quad 0x3feffbe599f4f9d9, 0x3f65a90b00981d93 + .quad 0x3feffc10194fcb64, 0x3f64d78bba8ca5fd + .quad 0x3feffc38fcffbb7c, 0x3f640d564548fad7 + .quad 0x3feffc60535dd7f5, 0x3f634a305080681f + .quad 0x3feffc862a501fd7, 0x3f628de11c5031eb + .quad 0x3feffcaa8f4c9bea, 0x3f61d83170fbf6fb + .quad 0x3feffccd8f5c66d1, 0x3f6128eb96be8798 + .quad 0x3feffcef371ea4d7, 0x3f607fdb4dafea5f + .quad 0x3feffd0f92cb6ba7, 0x3f5fb99b8b8279e1 + .quad 0x3feffd2eae369a07, 0x3f5e7f232d9e2630 + .quad 0x3feffd4c94d29fdb, 0x3f5d4fed7195d7e8 + .quad 0x3feffd6951b33686, 0x3f5c2b9cf7f893bf + .quad 0x3feffd84ef9009ee, 0x3f5b11d702b3deb2 + .quad 0x3feffd9f78c7524a, 0x3f5a024365f771bd + .quad 0x3feffdb8f7605ee7, 0x3f58fc8c794b03b5 + .quad 0x3feffdd1750e1220, 0x3f58005f08d6f1ef + .quad 0x3feffde8fb314ebf, 0x3f570d6a46e07dda + .quad 0x3feffdff92db56e5, 0x3f56235fbd7a4345 + .quad 0x3feffe1544d01ccb, 0x3f5541f340697987 + .quad 0x3feffe2a1988857c, 0x3f5468dadf4080ab + .quad 0x3feffe3e19349dc7, 0x3f5397ced7af2b15 + .quad 0x3feffe514bbdc197, 0x3f52ce898809244e + .quad 0x3feffe63b8c8b5f7, 0x3f520cc76202c5fb + .quad 0x3feffe7567b7b5e1, 0x3f515246dda49d47 + .quad 0x3feffe865fac722b, 0x3f509ec86c75d497 + .quad 0x3feffe96a78a04a9, 0x3f4fe41cd9bb4eee + .quad 0x3feffea645f6d6da, 0x3f4e97ba3b77f306 + .quad 0x3feffeb5415e7c44, 0x3f4d57f524723822 + .quad 0x3feffec39ff380b9, 0x3f4c245d4b99847a + .quad 0x3feffed167b12ac2, 0x3f4afc85e0f82e12 + .quad 0x3feffede9e5d3262, 0x3f49e005769dbc1d + .quad 0x3feffeeb49896c6d, 0x3f48ce75e9f6f8a0 + .quad 0x3feffef76e956a9f, 0x3f47c7744d9378f7 + .quad 0x3fefff0312b010b5, 0x3f46caa0d3582fe9 + .quad 0x3fefff0e3ad91ec2, 0x3f45d79eb71e893b + .quad 0x3fefff18ebe2b0e1, 0x3f44ee1429bf7cc0 + .quad 0x3fefff232a72b48e, 0x3f440daa3c89f5b6 + .quad 0x3fefff2cfb0453d9, 0x3f43360ccd23db3a + .quad 0x3fefff3661e9569d, 0x3f4266ea71d4f71a + .quad 0x3fefff3f634b79f9, 0x3f419ff4663ae9df + .quad 0x3fefff48032dbe40, 0x3f40e0de78654d1e + .quad 0x3fefff50456dab8c, 0x3f40295ef6591848 + .quad 0x3fefff582dc48d30, 0x3f3ef25d37f49fe1 + .quad 0x3fefff5fbfc8a439, 0x3f3da01102b5f851 + .quad 0x3fefff66feee5129, 0x3f3c5b5412dcafad + .quad 0x3fefff6dee89352e, 0x3f3b23a5a23e4210 + .quad 0x3fefff7491cd4af6, 0x3f39f8893d8fd1c1 + .quad 0x3fefff7aebcff755, 0x3f38d986a4187285 + .quad 0x3fefff80ff8911fd, 0x3f37c629a822bc9e + .quad 0x3fefff86cfd3e657, 0x3f36be02102b3520 + .quad 0x3fefff8c5f702ccf, 0x3f35c0a378c90bca + .quad 0x3fefff91b102fca8, 0x3f34cda5374ea275 + .quad 0x3fefff96c717b695, 0x3f33e4a23d1f4703 + .quad 0x3fefff9ba420e834, 0x3f330538fbb77ecd + .quad 0x3fefffa04a7928b1, 0x3f322f0b496539be + .quad 0x3fefffa4bc63ee9a, 0x3f3161be46ad3b50 + .quad 0x3fefffa8fc0e5f33, 0x3f309cfa445b00ff + .quad 0x3fefffad0b901755, 0x3f2fc0d55470cf51 + .quad 0x3fefffb0ecebee1b, 0x3f2e577bbcd49935 + .quad 0x3fefffb4a210b172, 0x3f2cfd4a5adec5c0 + .quad 0x3fefffb82cd9dcbf, 0x3f2bb1a9657ce465 + .quad 0x3fefffbb8f1049c6, 0x3f2a740684026555 + .quad 0x3fefffbeca6adbe9, 0x3f2943d4a1d1ed39 + .quad 0x3fefffc1e08f25f5, 0x3f28208bc334a6a5 + .quad 0x3fefffc4d3120aa1, 0x3f2709a8db59f25c + .quad 0x3fefffc7a37857d2, 0x3f25feada379d8b7 + .quad 0x3fefffca53375ce3, 0x3f24ff207314a102 + .quad 0x3fefffcce3b57bff, 0x3f240a8c1949f75e + .quad 0x3fefffcf564ab6b7, 0x3f23207fb7420eb9 + .quad 0x3fefffd1ac4135f9, 0x3f22408e9ba3327f + .quad 0x3fefffd3e6d5cd87, 0x3f216a501f0e42ca + .quad 0x3fefffd607387b07, 0x3f209d5f819c9e29 + .quad 0x3fefffd80e8ce0da, 0x3f1fb2b792b40a22 + .quad 0x3fefffd9fdeabcce, 0x3f1e3bcf436a1a95 + .quad 0x3fefffdbd65e5ad0, 0x3f1cd55277c18d05 + .quad 0x3fefffdd98e903b2, 0x3f1b7e94604479dc + .quad 0x3fefffdf46816833, 0x3f1a36eec00926dd + .quad 0x3fefffe0e0140857, 0x3f18fdc1b2dcf7b9 + .quad 0x3fefffe26683972a, 0x3f17d2737527c3f9 + .quad 0x3fefffe3daa95b18, 0x3f16b4702d7d5849 + .quad 0x3fefffe53d558ae9, 0x3f15a329b7d30748 + .quad 0x3fefffe68f4fa777, 0x3f149e17724f4d41 + .quad 0x3fefffe7d156d244, 0x3f13a4b60ba9aa4e + .quad 0x3fefffe904222101, 0x3f12b6875310f785 + .quad 0x3fefffea2860ee1e, 0x3f11d312098e9dba + .quad 0x3fefffeb3ebb267b, 0x3f10f9e1b4dd36df + .quad 0x3fefffec47d19457, 0x3f102a8673a94692 + .quad 0x3fefffed443e2787, 0x3f0ec929a665b449 + .quad 0x3fefffee34943b15, 0x3f0d4f4b4c8e09ed + .quad 0x3fefffef1960d85d, 0x3f0be6abbb10a5aa + .quad 0x3fefffeff32af7af, 0x3f0a8e8cc1fadef6 + .quad 0x3feffff0c273bea2, 0x3f094637d5bacfdb + .quad 0x3feffff187b6bc0e, 0x3f080cfdc72220cf + .quad 0x3feffff2436a21dc, 0x3f06e2367dc27f95 + .quad 0x3feffff2f5fefcaa, 0x3f05c540b4936fd2 + .quad 0x3feffff39fe16963, 0x3f04b581b8d170fc + .quad 0x3feffff44178c8d2, 0x3f03b2652b06c2b2 + .quad 0x3feffff4db27f146, 0x3f02bb5cc22e5db6 + .quad 0x3feffff56d4d5e5e, 0x3f01cfe010e2052d + .quad 0x3feffff5f8435efc, 0x3f00ef6c4c84a0fe + .quad 0x3feffff67c604180, 0x3f001984165a5f36 + .quad 0x3feffff6f9f67e55, 0x3efe9b5e8d00ce77 + .quad 0x3feffff77154e0d6, 0x3efd16f5716c6c1a + .quad 0x3feffff7e2c6aea2, 0x3efba4f035d60e03 + .quad 0x3feffff84e93cd75, 0x3efa447b7b03f045 + .quad 0x3feffff8b500e77c, 0x3ef8f4ccca7fc90d + .quad 0x3feffff9164f8e46, 0x3ef7b5223dac7336 + .quad 0x3feffff972be5c59, 0x3ef684c227fcacef + .quad 0x3feffff9ca891572, 0x3ef562fac4329b48 + .quad 0x3feffffa1de8c582, 0x3ef44f21e49054f2 + .quad 0x3feffffa6d13de73, 0x3ef34894a5e24657 + .quad 0x3feffffab83e54b8, 0x3ef24eb7254ccf83 + .quad 0x3feffffaff99bac4, 0x3ef160f438c70913 + .quad 0x3feffffb43555b5f, 0x3ef07ebd2a2d2844 + .quad 0x3feffffb839e52f3, 0x3eef4f12e9ab070a + .quad 0x3feffffbc09fa7cd, 0x3eedb5ad0b27805c + .quad 0x3feffffbfa82616b, 0x3eec304efa2c6f4e + .quad 0x3feffffc316d9ed0, 0x3eeabe09e9144b5e + .quad 0x3feffffc6586abf6, 0x3ee95df988e76644 + .quad 0x3feffffc96f1165e, 0x3ee80f439b4ee04b + .quad 0x3feffffcc5cec0c1, 0x3ee6d11788a69c64 + .quad 0x3feffffcf23ff5fc, 0x3ee5a2adfa0b4bc4 + .quad 0x3feffffd1c637b2b, 0x3ee4834877429b8f + .quad 0x3feffffd4456a10d, 0x3ee37231085c7d9a + .quad 0x3feffffd6a3554a1, 0x3ee26eb9daed6f7e + .quad 0x3feffffd8e1a2f22, 0x3ee1783ceac28910 + .quad 0x3feffffdb01e8546, 0x3ee08e1badf0fced + .quad 0x3feffffdd05a75ea, 0x3edf5f7d88472604 + .quad 0x3feffffdeee4f810, 0x3eddb92b5212fb8d + .quad 0x3feffffe0bd3e852, 0x3edc282cd3957eda + .quad 0x3feffffe273c15b7, 0x3edaab7abace48dc + .quad 0x3feffffe41314e06, 0x3ed94219bfcb4928 + .quad 0x3feffffe59c6698b, 0x3ed7eb1a2075864e + .quad 0x3feffffe710d565e, 0x3ed6a597219a93da + .quad 0x3feffffe8717232d, 0x3ed570b69502f313 + .quad 0x3feffffe9bf4098c, 0x3ed44ba864670882 + .quad 0x3feffffeafb377d5, 0x3ed335a62115bce2 + .quad 0x3feffffec2641a9e, 0x3ed22df298214423 + .quad 0x3feffffed413e5b7, 0x3ed133d96ae7e0dd + .quad 0x3feffffee4d01cd6, 0x3ed046aeabcfcdec + .quad 0x3feffffef4a55bd4, 0x3ececb9cfe1d8642 + .quad 0x3fefffff039f9e8f, 0x3ecd21397ead99cb + .quad 0x3fefffff11ca4876, 0x3ecb8d094c86d374 + .quad 0x3fefffff1f302bc1, 0x3eca0df0f0c626dc + .quad 0x3fefffff2bdb904d, 0x3ec8a2e269750a39 + .quad 0x3fefffff37d63a36, 0x3ec74adc8f4064d3 + .quad 0x3fefffff43297019, 0x3ec604ea819f007c + .quad 0x3fefffff4dde0118, 0x3ec4d0231928c6f9 + .quad 0x3fefffff57fc4a95, 0x3ec3aba85fe22e20 + .quad 0x3fefffff618c3da6, 0x3ec296a70f414053 + .quad 0x3fefffff6a956450, 0x3ec1905613b3abf2 + .quad 0x3fefffff731ee681, 0x3ec097f6156f32c5 + .quad 0x3fefffff7b2f8ed6, 0x3ebf59a20caf6695 + .quad 0x3fefffff82cdcf1b, 0x3ebd9c73698fb1dc + .quad 0x3fefffff89ffc4aa, 0x3ebbf716c6168bae + .quad 0x3fefffff90cb3c81, 0x3eba6852c6b58392 + .quad 0x3fefffff9735b73b, 0x3eb8eefd70594a89 + .quad 0x3fefffff9d446ccc, 0x3eb789fb715aae95 + .quad 0x3fefffffa2fc5015, 0x3eb6383f726a8e04 + .quad 0x3fefffffa8621251, 0x3eb4f8c96f26a26a + .quad 0x3fefffffad7a2652, 0x3eb3caa61607f920 + .quad 0x3fefffffb248c39d, 0x3eb2acee2f5ecdb8 + .quad 0x3fefffffb6d1e95d, 0x3eb19ec60b1242ed + .quad 0x3fefffffbb196132, 0x3eb09f5cf4dd2877 + .quad 0x3fefffffbf22c1e2, 0x3eaf5bd95d8730d8 + .quad 0x3fefffffc2f171e3, 0x3ead9371e2ff7c35 + .quad 0x3fefffffc688a9cf, 0x3eabe41de54d155a + .quad 0x3fefffffc9eb76ac, 0x3eaa4c89e08ef4f3 + .quad 0x3fefffffcd1cbc28, 0x3ea8cb738399b12c + .quad 0x3fefffffd01f36af, 0x3ea75fa8dbc84bec + .quad 0x3fefffffd2f57d68, 0x3ea608078a70dcbc + .quad 0x3fefffffd5a2041f, 0x3ea4c37c0394d094 + .quad 0x3fefffffd8271d12, 0x3ea39100d5687bfe + .quad 0x3fefffffda86faa9, 0x3ea26f9df8519bd7 + .quad 0x3fefffffdcc3b117, 0x3ea15e6827001f18 + .quad 0x3fefffffdedf37ed, 0x3ea05c803e4831c1 + .quad 0x3fefffffe0db6b91, 0x3e9ed22548cffd35 + .quad 0x3fefffffe2ba0ea5, 0x3e9d06ad6ecdf971 + .quad 0x3fefffffe47ccb60, 0x3e9b551c847fbc96 + .quad 0x3fefffffe62534d4, 0x3e99bc09f112b494 + .quad 0x3fefffffe7b4c81e, 0x3e983a1ff0aa239d + .quad 0x3fefffffe92ced93, 0x3e96ce1aa3fd7bdd + .quad 0x3fefffffea8ef9cf, 0x3e9576c72b514859 + .quad 0x3fefffffebdc2ec6, 0x3e943302cc4a0da8 + .quad 0x3fefffffed15bcba, 0x3e9301ba221dc9bb + .quad 0x3fefffffee3cc32c, 0x3e91e1e857adc568 + .quad 0x3fefffffef5251c2, 0x3e90d2966b1746f7 + .quad 0x3feffffff0576917, 0x3e8fa5b4f49cc6b2 + .quad 0x3feffffff14cfb92, 0x3e8dc3ae30b55c16 + .quad 0x3feffffff233ee1d, 0x3e8bfd7555a3bd68 + .quad 0x3feffffff30d18e8, 0x3e8a517d9e61628a + .quad 0x3feffffff3d9480f, 0x3e88be4f8f6c951f + .quad 0x3feffffff4993c46, 0x3e874287ded49339 + .quad 0x3feffffff54dab72, 0x3e85dcd669f2cd34 + .quad 0x3feffffff5f74141, 0x3e848bfd38302871 + .quad 0x3feffffff6969fb8, 0x3e834ecf8a3c124a + .quad 0x3feffffff72c5fb6, 0x3e822430f521cbcf + .quad 0x3feffffff7b91176, 0x3e810b1488aeb235 + .quad 0x3feffffff83d3d07, 0x3e80027c00a263a6 + .quad 0x3feffffff8b962be, 0x3e7e12ee004efc37 + .quad 0x3feffffff92dfba2, 0x3e7c3e44ae32b16b + .quad 0x3feffffff99b79d2, 0x3e7a854ea14102a8 + .quad 0x3feffffffa0248e8, 0x3e78e6761569f45d + .quad 0x3feffffffa62ce54, 0x3e77603bac345f65 + .quad 0x3feffffffabd69b4, 0x3e75f1353cdad001 + .quad 0x3feffffffb127525, 0x3e74980cb3c80949 + .quad 0x3feffffffb624592, 0x3e73537f00b6ad4d + .quad 0x3feffffffbad2aff, 0x3e72225b12bffc68 + .quad 0x3feffffffbf370cd, 0x3e710380e1adb7e9 + .quad 0x3feffffffc355dfd, 0x3e6febc107d5efaa + .quad 0x3feffffffc733572, 0x3e6df0f2a0ee6947 + .quad 0x3feffffffcad3626, 0x3e6c14b2188bcee4 + .quad 0x3feffffffce39b67, 0x3e6a553644f7f07d + .quad 0x3feffffffd169d0c, 0x3e68b0cfce0579e0 + .quad 0x3feffffffd466fa5, 0x3e6725e7c5dd20f7 + .quad 0x3feffffffd7344aa, 0x3e65b2fe547a1340 + .quad 0x3feffffffd9d4aab, 0x3e6456a974e92e93 + .quad 0x3feffffffdc4ad7a, 0x3e630f93c3699078 + .quad 0x3feffffffde9964e, 0x3e61dc7b5b978cf8 + .quad 0x3feffffffe0c2bf0, 0x3e60bc30c5d52f15 + .quad 0x3feffffffe2c92db, 0x3e5f5b2be65a0c7f + .quad 0x3feffffffe4aed5e, 0x3e5d5f3a8dea7357 + .quad 0x3feffffffe675bbd, 0x3e5b82915b03515b + .quad 0x3feffffffe81fc4e, 0x3e59c3517e789488 + .quad 0x3feffffffe9aeb97, 0x3e581fb7df06136e + .quad 0x3feffffffeb24467, 0x3e56961b8d641d06 + .quad 0x3feffffffec81ff2, 0x3e5524ec4d916cae + .quad 0x3feffffffedc95e7, 0x3e53cab1343d18d1 + .quad 0x3feffffffeefbc85, 0x3e52860757487a01 + .quad 0x3fefffffff01a8b6, 0x3e5155a09065d4f7 + .quad 0x3fefffffff126e1e, 0x3e50384250e4c9fc + .quad 0x3fefffffff221f30, 0x3e4e59890b926c78 + .quad 0x3fefffffff30cd3f, 0x3e4c642116a8a9e3 + .quad 0x3fefffffff3e8892, 0x3e4a8e405e651ab6 + .quad 0x3fefffffff4b606f, 0x3e48d5f98114f872 + .quad 0x3fefffffff57632d, 0x3e47397c5a66e307 + .quad 0x3fefffffff629e44, 0x3e45b71456c5a4c4 + .quad 0x3fefffffff6d1e56, 0x3e444d26de513197 + .quad 0x3fefffffff76ef3f, 0x3e42fa31d6371537 + .quad 0x3fefffffff801c1f, 0x3e41bcca373b7b43 + .quad 0x3fefffffff88af67, 0x3e40939ab853339f + .quad 0x3fefffffff90b2e3, 0x3e3efac5187b2863 + .quad 0x3fefffffff982fc1, 0x3e3cf1e86235d0e7 + .quad 0x3fefffffff9f2e9f, 0x3e3b0a68a2128bab + .quad 0x3fefffffffa5b790, 0x3e39423165bc4444 + .quad 0x3fefffffffabd229, 0x3e37974e743dea3d + .quad 0x3fefffffffb18582, 0x3e3607e9eacd1050 + .quad 0x3fefffffffb6d844, 0x3e34924a74dec729 + .quad 0x3fefffffffbbd0aa, 0x3e3334d19e0c2160 + .quad 0x3fefffffffc0748f, 0x3e31edfa3c5f5cca + .quad 0x3fefffffffc4c96c, 0x3e30bc56f1b54701 + .quad 0x3fefffffffc8d462, 0x3e2f3d2185e047d9 + .quad 0x3fefffffffcc9a41, 0x3e2d26cb87945e87 + .quad 0x3fefffffffd01f89, 0x3e2b334fac4b9f99 + .quad 0x3fefffffffd36871, 0x3e296076f7918d1c + .quad 0x3fefffffffd678ed, 0x3e27ac2d72fc2c63 + .quad 0x3fefffffffd954ae, 0x3e2614801550319e + .quad 0x3fefffffffdbff2a, 0x3e24979ac8b28927 + .quad 0x3fefffffffde7ba0, 0x3e2333c68e2d0548 + .quad 0x3fefffffffe0cd16, 0x3e21e767bce37dd7 + .quad 0x3fefffffffe2f664, 0x3e20b0fc5b6d05a0 + .quad 0x3fefffffffe4fa30, 0x3e1f1e3523b41d7d + .quad 0x3fefffffffe6daf7, 0x3e1d00de6608effe + .quad 0x3fefffffffe89b0c, 0x3e1b0778b7b3301b + .quad 0x3fefffffffea3c9a, 0x3e192fb04ec0f6cf + .quad 0x3fefffffffebc1a9, 0x3e177756ec9f78fa + .quad 0x3fefffffffed2c21, 0x3e15dc61922d5a06 + .quad 0x3fefffffffee7dc8, 0x3e145ce65699ff6d + .quad 0x3fefffffffefb847, 0x3e12f71a5f159970 + .quad 0x3feffffffff0dd2b, 0x3e11a94ff571654f + .quad 0x3feffffffff1ede9, 0x3e1071f4bbea09ec + .quad 0x3feffffffff2ebda, 0x3e0e9f1ff8ddd774 + .quad 0x3feffffffff3d843, 0x3e0c818223a202c7 + .quad 0x3feffffffff4b453, 0x3e0a887bd2b4404d + .quad 0x3feffffffff58126, 0x3e08b1a336c5eb6b + .quad 0x3feffffffff63fc3, 0x3e06fab63324088a + .quad 0x3feffffffff6f121, 0x3e056197e30205ba + .quad 0x3feffffffff79626, 0x3e03e44e45301b92 + .quad 0x3feffffffff82fab, 0x3e0281000bfe4c3f + .quad 0x3feffffffff8be77, 0x3e0135f28f2d50b4 + .quad 0x3feffffffff94346, 0x3e000187dded5975 + .quad 0x3feffffffff9bec8, 0x3dfdc479de0ef001 + .quad 0x3feffffffffa319f, 0x3dfbad4fdad3caa1 + .quad 0x3feffffffffa9c63, 0x3df9baed3ed27ab8 + .quad 0x3feffffffffaffa4, 0x3df7ead9ce4285bb + .quad 0x3feffffffffb5be5, 0x3df63ac6b4edc88e + .quad 0x3feffffffffbb1a2, 0x3df4a88be2a6390c + .quad 0x3feffffffffc014e, 0x3df332259185f1a0 + .quad 0x3feffffffffc4b56, 0x3df1d5b1f3793044 + .quad 0x3feffffffffc901c, 0x3df0916f04b6e18b + .quad 0x3feffffffffccfff, 0x3deec77101de6926 + .quad 0x3feffffffffd0b56, 0x3dec960bf23153e0 + .quad 0x3feffffffffd4271, 0x3dea8bd20fc65ef7 + .quad 0x3feffffffffd759d, 0x3de8a61745ec7d1d + .quad 0x3feffffffffda520, 0x3de6e25d0e756261 + .quad 0x3feffffffffdd13c, 0x3de53e4f7d1666cb + .quad 0x3feffffffffdfa2d, 0x3de3b7c27a7ddb0e + .quad 0x3feffffffffe202d, 0x3de24caf2c32af14 + .quad 0x3feffffffffe4371, 0x3de0fb3186804d0f + .quad 0x3feffffffffe642a, 0x3ddf830c0bb41fd7 + .quad 0x3feffffffffe8286, 0x3ddd3c0f1a91c846 + .quad 0x3feffffffffe9eb0, 0x3ddb1e5acf351d87 + .quad 0x3feffffffffeb8d0, 0x3dd92712d259ce66 + .quad 0x3feffffffffed10a, 0x3dd7538c60a04476 + .quad 0x3feffffffffee782, 0x3dd5a14b04b47879 + .quad 0x3feffffffffefc57, 0x3dd40dfd87456f4c + .quad 0x3fefffffffff0fa7, 0x3dd2977b1172b9d5 + .quad 0x3fefffffffff218f, 0x3dd13bc07e891491 + .quad 0x3fefffffffff3227, 0x3dcff1dbb4300811 + .quad 0x3fefffffffff4188, 0x3dcd9a880f306bd8 + .quad 0x3fefffffffff4fc9, 0x3dcb6e45220b55e0 + .quad 0x3fefffffffff5cfd, 0x3dc96a0b33f2c4da + .quad 0x3fefffffffff6939, 0x3dc78b07e9e924ac + .quad 0x3fefffffffff748e, 0x3dc5ce9ab1670dd2 + .quad 0x3fefffffffff7f0d, 0x3dc4325167006bb0 + .quad 0x3fefffffffff88c5, 0x3dc2b3e53538ff3f + .quad 0x3fefffffffff91c6, 0x3dc15137a7f44864 + .quad 0x3fefffffffff9a1b, 0x3dc0084ff125639d + .quad 0x3fefffffffffa1d2, 0x3dbdaeb0b7311ec7 + .quad 0x3fefffffffffa8f6, 0x3dbb7937d1c40c53 + .quad 0x3fefffffffffaf92, 0x3db96d082f59ab06 + .quad 0x3fefffffffffb5b0, 0x3db7872d9fa10aad + .quad 0x3fefffffffffbb58, 0x3db5c4e8e37bc7d0 + .quad 0x3fefffffffffc095, 0x3db423ac0df49a40 + .quad 0x3fefffffffffc56d, 0x3db2a117230ad284 + .quad 0x3fefffffffffc9e8, 0x3db13af4f04f9998 + .quad 0x3fefffffffffce0d, 0x3dafde703724e560 + .quad 0x3fefffffffffd1e1, 0x3dad77f0c82e7641 + .quad 0x3fefffffffffd56c, 0x3dab3ee02611d7dd + .quad 0x3fefffffffffd8b3, 0x3da92ff33023d5bd + .quad 0x3fefffffffffdbba, 0x3da7481a9e69f53f + .quad 0x3fefffffffffde86, 0x3da5847eda620959 + .quad 0x3fefffffffffe11d, 0x3da3e27c1fcc74bd + .quad 0x3fefffffffffe380, 0x3da25f9ee0b923dc + .quad 0x3fefffffffffe5b6, 0x3da0f9a068653200 + .quad 0x3fefffffffffe7c0, 0x3d9f5cc7718082b0 + .quad 0x3fefffffffffe9a2, 0x3d9cf7e53d6a2ca5 + .quad 0x3fefffffffffeb60, 0x3d9ac0f5f3229372 + .quad 0x3fefffffffffecfb, 0x3d98b498644847ea + .quad 0x3fefffffffffee77, 0x3d96cfa9bcca59dc + .quad 0x3fefffffffffefd6, 0x3d950f411d4fd2cd + .quad 0x3feffffffffff11a, 0x3d9370ab8327af5e + .quad 0x3feffffffffff245, 0x3d91f167f88c6b6e + .quad 0x3feffffffffff359, 0x3d908f24085d4597 + .quad 0x3feffffffffff457, 0x3d8e8f70e181d61a + .quad 0x3feffffffffff542, 0x3d8c324c20e337dc + .quad 0x3feffffffffff61b, 0x3d8a03261574b54e + .quad 0x3feffffffffff6e3, 0x3d87fe903cdf5855 + .quad 0x3feffffffffff79b, 0x3d86215c58da3450 + .quad 0x3feffffffffff845, 0x3d846897d4b69fc6 + .quad 0x3feffffffffff8e2, 0x3d82d1877d731b7b + .quad 0x3feffffffffff973, 0x3d8159a386b11517 + .quad 0x3feffffffffff9f8, 0x3d7ffd27ae9393ce + .quad 0x3feffffffffffa73, 0x3d7d7c593130dd0b + .quad 0x3feffffffffffae4, 0x3d7b2cd607c79bcf + .quad 0x3feffffffffffb4c, 0x3d790ae4d3405651 + .quad 0x3feffffffffffbad, 0x3d771312dd1759e2 + .quad 0x3feffffffffffc05, 0x3d75422ef5d8949d + .quad 0x3feffffffffffc57, 0x3d739544b0ecc957 + .quad 0x3feffffffffffca2, 0x3d720997f73e73dd + .quad 0x3feffffffffffce7, 0x3d709ca0eaacd277 + .quad 0x3feffffffffffd27, 0x3d6e9810295890ec + .quad 0x3feffffffffffd62, 0x3d6c2b45b5aa4a1d + .quad 0x3feffffffffffd98, 0x3d69eee068fa7596 + .quad 0x3feffffffffffdca, 0x3d67df2b399c10a8 + .quad 0x3feffffffffffdf8, 0x3d65f8b87a31bd85 + .quad 0x3feffffffffffe22, 0x3d64385c96e9a2d9 + .quad 0x3feffffffffffe49, 0x3d629b2933ef4cbc + .quad 0x3feffffffffffe6c, 0x3d611e68a6378f8a + .quad 0x3feffffffffffe8d, 0x3d5f7f338086a86b + .quad 0x3feffffffffffeab, 0x3d5cf8d7d9ce040a + .quad 0x3feffffffffffec7, 0x3d5aa577251ae485 + .quad 0x3feffffffffffee1, 0x3d58811d739efb5f + .quad 0x3feffffffffffef8, 0x3d568823e52970be + .quad 0x3fefffffffffff0e, 0x3d54b72ae68e8b4c + .quad 0x3fefffffffffff22, 0x3d530b14dbe876bc + .quad 0x3fefffffffffff34, 0x3d5181012ef86610 + .quad 0x3fefffffffffff45, 0x3d501647ba798745 + .quad 0x3fefffffffffff54, 0x3d4d90e917701675 + .quad 0x3fefffffffffff62, 0x3d4b2a87e86d0c8a + .quad 0x3fefffffffffff6f, 0x3d48f53dcb377293 + .quad 0x3fefffffffffff7b, 0x3d46ed2f2515e933 + .quad 0x3fefffffffffff86, 0x3d450ecc9ed47f19 + .quad 0x3fefffffffffff90, 0x3d4356cd5ce7799e + .quad 0x3fefffffffffff9a, 0x3d41c229a587ab78 + .quad 0x3fefffffffffffa2, 0x3d404e15ecc7f3f6 + .quad 0x3fefffffffffffaa, 0x3d3deffc7e6a6017 + .quad 0x3fefffffffffffb1, 0x3d3b7b040832f310 + .quad 0x3fefffffffffffb8, 0x3d3938e021f36d76 + .quad 0x3fefffffffffffbe, 0x3d37258610b3b233 + .quad 0x3fefffffffffffc3, 0x3d353d3bfc82a909 + .quad 0x3fefffffffffffc8, 0x3d337c92babdc2fd + .quad 0x3fefffffffffffcd, 0x3d31e06010120f6a + .quad 0x3fefffffffffffd1, 0x3d3065b9616170d4 + .quad 0x3fefffffffffffd5, 0x3d2e13dd96b3753b + .quad 0x3fefffffffffffd9, 0x3d2b950d32467392 + .quad 0x3fefffffffffffdc, 0x3d294a72263259a5 + .quad 0x3fefffffffffffdf, 0x3d272fd93e036cdc + .quad 0x3fefffffffffffe2, 0x3d254164576929ab + .quad 0x3fefffffffffffe4, 0x3d237b83c521fe96 + .quad 0x3fefffffffffffe7, 0x3d21daf033182e96 + .quad 0x3fefffffffffffe9, 0x3d205ca50205d26a + .quad 0x3fefffffffffffeb, 0x3d1dfbb6235639fa + .quad 0x3fefffffffffffed, 0x3d1b7807e294781f + .quad 0x3fefffffffffffee, 0x3d19298add70a734 + .quad 0x3feffffffffffff0, 0x3d170beaf9c7ffb6 + .quad 0x3feffffffffffff1, 0x3d151b2cd6709222 + .quad 0x3feffffffffffff3, 0x3d1353a6cf7f7fff + .quad 0x3feffffffffffff4, 0x3d11b1fa8cbe84a7 + .quad 0x3feffffffffffff5, 0x3d10330f0fd69921 + .quad 0x3feffffffffffff6, 0x3d0da81670f96f9b + .quad 0x3feffffffffffff7, 0x3d0b24a16b4d09aa + .quad 0x3feffffffffffff7, 0x3d08d6eeb6efdbd6 + .quad 0x3feffffffffffff8, 0x3d06ba91ac734786 + .quad 0x3feffffffffffff9, 0x3d04cb7966770ab5 + .quad 0x3feffffffffffff9, 0x3d0305e9721d0981 + .quad 0x3feffffffffffffa, 0x3d01667311fff70a + .quad 0x3feffffffffffffb, 0x3cffd3de10d62855 + .quad 0x3feffffffffffffb, 0x3cfd1aefbcd48d0c + .quad 0x3feffffffffffffb, 0x3cfa9cc93c25aca9 + .quad 0x3feffffffffffffc, 0x3cf85487ee3ea735 + .quad 0x3feffffffffffffc, 0x3cf63daf8b4b1e0c + .quad 0x3feffffffffffffd, 0x3cf45421e69a6ca1 + .quad 0x3feffffffffffffd, 0x3cf294175802d99a + .quad 0x3feffffffffffffd, 0x3cf0fa17bf41068f + .quad 0x3feffffffffffffd, 0x3cef05e82aae2bb9 + .quad 0x3feffffffffffffe, 0x3cec578101b29058 + .quad 0x3feffffffffffffe, 0x3ce9e39dc5dd2f7c + .quad 0x3feffffffffffffe, 0x3ce7a553a728bbf2 + .quad 0x3feffffffffffffe, 0x3ce5982008db1304 + .quad 0x3feffffffffffffe, 0x3ce3b7e00422e51b + .quad 0x3feffffffffffffe, 0x3ce200c898d9ee3e + .quad 0x3fefffffffffffff, 0x3ce06f5f7eb65a56 + .quad 0x3fefffffffffffff, 0x3cde00e9148a1d25 + .quad 0x3fefffffffffffff, 0x3cdb623734024e92 + .quad 0x3fefffffffffffff, 0x3cd8fd4e01891bf8 + .quad 0x3fefffffffffffff, 0x3cd6cd44c7470d89 + .quad 0x3fefffffffffffff, 0x3cd4cd9c04158cd7 + .quad 0x3fefffffffffffff, 0x3cd2fa34bf5c8344 + .quad 0x3fefffffffffffff, 0x3cd14f4890ff2461 + .quad 0x3fefffffffffffff, 0x3ccf92c49dfa4df5 + .quad 0x3fefffffffffffff, 0x3ccccaaea71ab0df + .quad 0x3fefffffffffffff, 0x3cca40829f001197 + .quad 0x3ff0000000000000, 0x3cc7eef13b59e96c + .quad 0x3ff0000000000000, 0x3cc5d11e1a252bf5 + .quad 0x3ff0000000000000, 0x3cc3e296303b2297 + .quad 0x3ff0000000000000, 0x3cc21f47009f43ce + .quad 0x3ff0000000000000, 0x3cc083768c5e4542 + .quad 0x3ff0000000000000, 0x3cbe1777d831265f + .quad 0x3ff0000000000000, 0x3cbb69f10b0191b5 + .quad 0x3ff0000000000000, 0x3cb8f8a3a05b5b53 + .quad 0x3ff0000000000000, 0x3cb6be573c40c8e7 + .quad 0x3ff0000000000000, 0x3cb4b645ba991fdb + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff /* _AbsMask */ + .align 16 + .quad 0x4017f80000000000, 0x4017f80000000000 /* _MaxThreshold = 6.0 - 1.0/128.0 */ + .align 16 + .quad 0x42c0000000000000, 0x42c0000000000000 /* SRound */ + .align 16 + .quad 0x2ff0000000000000, 0x2ff0000000000000 /* _U2THreshold */ + .align 16 + .quad 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5 /* _poly_1_0 */ + .align 16 + .quad 0x3fc1111235a363b1, 0x3fc1111235a363b1 /* _poly_1_1 */ + .align 16 + .quad 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57 /* _poly_3_0 */ + .align 16 + .quad 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8 /* _poly_3_1 */ + .align 16 + .quad 0xbfc5555800001B4F, 0xbfc5555800001B4F /* _poly_5_0 */ + .align 16 + .quad 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122 /* _poly_5_1 */ + .align 16 + .quad 0xbfd55555555547f6, 0xbfd55555555547f6 /* _poly_1_2 */ + .align 16 + .quad 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd /* _poly_3_2 */ + .align 16 + .quad 0x3fe5555555554b0c, 0x3fe5555555554b0c /* _poly_1_3 */ + .align 16 + .quad 0xbfd5555555555555, 0xbfd5555555555555 /* _poly_3_3 */ + .align 16 + .type __svml_derf_data_internal,@object + .size __svml_derf_data_internal,.-__svml_derf_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core-sse.S new file mode 100644 index 0000000000..704785738f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized erf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_erf _ZGVdN4v_erf_sse_wrapper +#include "../svml_d_erf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core.c new file mode 100644 index 0000000000..0647917209 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized erf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_erf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_erf, __GI__ZGVdN4v_erf, __redirect__ZGVdN4v_erf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core_avx2.S new file mode 100644 index 0000000000..bd7226cd5c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf4_core_avx2.S @@ -0,0 +1,984 @@ +/* Function erf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Basic formula is + * erf(x) ~ erf(x0) + + * + exp(-x0*x0)*D*(1+c0+T*P1(T)+D^2*P3(T)+D^4*P5(T)+D^6*p7+D^8*p9) + * where D=x-x0, T=x0*D + * x0 is x rounded to a specified number of fractional bits (in this case 7), + * except that x0=0 for |x|<3.5/128.0 (using x0=0 for first 4 table entries) + * + * Data table packs both erf(x0)_high and a few bits of erf(x0)_low in one + * entry (in place of redundant exponent bits) + * + */ + +/* Offsets for data table __svml_derf_data_internal + */ +#define _erf_tbl 0 +#define _AbsMask 12288 +#define _MaxThreshold 12320 +#define _SRound 12352 +#define _U2Threshold 12384 +#define _poly1_0 12416 +#define _poly1_1 12448 +#define _poly3_0 12480 +#define _poly3_1 12512 +#define _poly5_0 12544 +#define _poly5_1 12576 +#define _poly1_2 12608 +#define _poly3_2 12640 +#define _poly1_3 12672 +#define _poly3_3 12704 +#define _Mask32 12736 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_erf_avx2) +/* + * vector gather: erf(x0), + * second value is exp(-x0*x0) + */ + lea __svml_derf_data_internal(%rip), %rdi + vmovupd _SRound+__svml_derf_data_internal(%rip), %ymm6 + vandpd _AbsMask+__svml_derf_data_internal(%rip), %ymm0, %ymm5 + +/* + * erf(x) rounds to 1.0 for x>_MaxThreshold (5.9921875) + * can compute all results in the main path + */ + vminpd _MaxThreshold+__svml_derf_data_internal(%rip), %ymm5, %ymm7 + vaddpd %ymm6, %ymm7, %ymm10 + vcmpgt_oqpd _U2Threshold+__svml_derf_data_internal(%rip), %ymm7, %ymm9 + vpsllq $4, %ymm10, %ymm11 + vsubpd %ymm6, %ymm10, %ymm8 + vandps _Mask32+__svml_derf_data_internal(%rip), %ymm11, %ymm12 + vsubpd %ymm8, %ymm7, %ymm3 + vmulpd %ymm3, %ymm8, %ymm2 + vandpd %ymm9, %ymm3, %ymm1 + +/* NaN fixup */ + vminpd %ymm5, %ymm3, %ymm3 + +/* save sign */ + vxorpd %ymm0, %ymm5, %ymm4 + +/* T^2 */ + vmulpd %ymm2, %ymm2, %ymm5 + vextractf128 $1, %ymm12, %xmm13 + vmovd %xmm12, %eax + vmovd %xmm13, %ecx + vpextrd $2, %xmm12, %edx + vpextrd $2, %xmm13, %esi + movslq %eax, %rax + movslq %edx, %rdx + movslq %ecx, %rcx + movslq %esi, %rsi + +/* Sign | Diff */ + vxorpd %ymm4, %ymm3, %ymm12 + +/* + * _LA_ polynomial computation + * Start polynomial evaluation + */ + vmovupd _poly1_0+__svml_derf_data_internal(%rip), %ymm3 + vmovupd (%rdi,%rax), %xmm6 + vmovupd (%rdi,%rdx), %xmm7 + vmovupd (%rdi,%rcx), %xmm8 + vmovupd (%rdi,%rsi), %xmm9 + vunpcklpd %xmm7, %xmm6, %xmm14 + vunpcklpd %xmm9, %xmm8, %xmm15 + +/* D2 = Diff^2 */ + vmulpd %ymm1, %ymm1, %ymm13 + vfmadd213pd _poly1_1+__svml_derf_data_internal(%rip), %ymm2, %ymm3 + vmovupd _poly5_0+__svml_derf_data_internal(%rip), %ymm1 + vunpckhpd %xmm9, %xmm8, %xmm10 + vfmadd213pd _poly1_2+__svml_derf_data_internal(%rip), %ymm2, %ymm3 + vfmadd213pd _poly5_1+__svml_derf_data_internal(%rip), %ymm2, %ymm1 + vfmadd213pd _poly1_3+__svml_derf_data_internal(%rip), %ymm2, %ymm3 + vfmadd213pd _poly3_3+__svml_derf_data_internal(%rip), %ymm13, %ymm1 + +/* P1 = T^2*P1 - T */ + vfmsub213pd %ymm2, %ymm5, %ymm3 + vinsertf128 $1, %xmm15, %ymm14, %ymm0 + vunpckhpd %xmm7, %xmm6, %xmm14 + vmovupd _poly3_0+__svml_derf_data_internal(%rip), %ymm6 + vfmadd213pd _poly3_1+__svml_derf_data_internal(%rip), %ymm2, %ymm6 + vfmadd213pd _poly3_2+__svml_derf_data_internal(%rip), %ymm2, %ymm6 + vfmadd213pd %ymm1, %ymm2, %ymm6 + +/* P1 + P3*D2 */ + vfmadd213pd %ymm3, %ymm13, %ymm6 + +/* Sign | _Erf_H */ + vxorpd %ymm4, %ymm0, %ymm0 + vinsertf128 $1, %xmm10, %ymm14, %ymm11 + +/* exp_h(x0) * Diff */ + vmulpd %ymm12, %ymm11, %ymm2 + +/* + * branch-free + * low part of result: exp_h(x0) * Diff*(1+P1) + */ + vfmadd213pd %ymm2, %ymm2, %ymm6 + +/* Final result */ + vaddpd %ymm6, %ymm0, %ymm15 + +/* Fix erf(-0) = -0 */ + vorpd %ymm4, %ymm15, %ymm0 + ret + +END(_ZGVdN4v_erf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_derf_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _erf_tbl[6*128*2][2]; + __declspec(align(32)) VUINT32 _AbsMask[4][2]; + __declspec(align(32)) VUINT32 _MaxThreshold[4][2]; + __declspec(align(32)) VUINT32 _SRound[4][2]; + __declspec(align(32)) VUINT32 _U2Threshold[4][2]; + __declspec(align(32)) VUINT32 _poly1_0[4][2]; + __declspec(align(32)) VUINT32 _poly1_1[4][2]; + __declspec(align(32)) VUINT32 _poly3_0[4][2]; + __declspec(align(32)) VUINT32 _poly3_1[4][2]; + __declspec(align(32)) VUINT32 _poly5_0[4][2]; + __declspec(align(32)) VUINT32 _poly5_1[4][2]; + __declspec(align(32)) VUINT32 _poly1_2[4][2]; + __declspec(align(32)) VUINT32 _poly3_2[4][2]; + __declspec(align(32)) VUINT32 _poly1_3[4][2]; + __declspec(align(32)) VUINT32 _poly3_3[4][2]; + __declspec(align(32)) VUINT32 _Mask32[4][2]; +} __svml_derf_data_internal; +#endif +__svml_derf_data_internal: + /*== _erf_tbl ==*/ + .quad 0x0000000000000000, 0x3ff20dd750429b6d + .quad 0x3f820dbf3deb1340, 0x3ff20d8f1975c85d + .quad 0x3f920d77083f17a0, 0x3ff20cb67bd452c7 + .quad 0x3f9b137e0cf584dc, 0x3ff20b4d8bac36c1 + .quad 0x3fa20c5645dd2538, 0x3ff209546ad13ccf + .quad 0x3fa68e5d3bbc9526, 0x3ff206cb4897b148 + .quad 0x3fab0fafef135745, 0x3ff203b261cd0053 + .quad 0x3faf902a77bd3821, 0x3ff2000a00ae3804 + .quad 0x3fb207d480e90658, 0x3ff1fbd27cdc72d3 + .quad 0x3fb44703e87e8593, 0x3ff1f70c3b4f2cc8 + .quad 0x3fb68591a1e83b5d, 0x3ff1f1b7ae44867f + .quad 0x3fb8c36beb8a8d23, 0x3ff1ebd5552f795b + .quad 0x3fbb0081148a873a, 0x3ff1e565bca400d4 + .quad 0x3fbd3cbf7e70a4b3, 0x3ff1de697e413d29 + .quad 0x3fbf78159ec8bb50, 0x3ff1d6e14099944a + .quad 0x3fc0d939005f65e5, 0x3ff1cecdb718d61c + .quad 0x3fc1f5e1a35c3b89, 0x3ff1c62fa1e869b6 + .quad 0x3fc311fc15f56d14, 0x3ff1bd07cdd189ac + .quad 0x3fc42d7fc2f64959, 0x3ff1b357141d95d5 + .quad 0x3fc548642321d7c6, 0x3ff1a91e5a748165 + .quad 0x3fc662a0bdf7a89f, 0x3ff19e5e92b964ab + .quad 0x3fc77c2d2a765f9e, 0x3ff19318bae53a04 + .quad 0x3fc895010fdbdbfd, 0x3ff1874ddcdfce24 + .quad 0x3fc9ad142662e14d, 0x3ff17aff0e56ec10 + .quad 0x3fcac45e37fe2526, 0x3ff16e2d7093cd8c + .quad 0x3fcbdad72110a648, 0x3ff160da304ed92f + .quad 0x3fccf076d1233237, 0x3ff153068581b781 + .quad 0x3fce05354b96ff36, 0x3ff144b3b337c90c + .quad 0x3fcf190aa85540e2, 0x3ff135e3075d076b + .quad 0x3fd015f78a3dcf3d, 0x3ff12695da8b5bde + .quad 0x3fd09eed6982b948, 0x3ff116cd8fd67618 + .quad 0x3fd127631eb8de32, 0x3ff1068b94962e5e + .quad 0x3fd1af54e232d609, 0x3ff0f5d1602f7e41 + .quad 0x3fd236bef825d9a2, 0x3ff0e4a073dc1b91 + .quad 0x3fd2bd9db0f7827f, 0x3ff0d2fa5a70c168 + .quad 0x3fd343ed6989b7d9, 0x3ff0c0e0a8223359 + .quad 0x3fd3c9aa8b84beda, 0x3ff0ae54fa490723 + .quad 0x3fd44ed18d9f6462, 0x3ff09b58f724416b + .quad 0x3fd4d35ef3e5372e, 0x3ff087ee4d9ad247 + .quad 0x3fd5574f4ffac98e, 0x3ff07416b4fbfe7c + .quad 0x3fd5da9f415ff23f, 0x3ff05fd3ecbec298 + .quad 0x3fd65d4b75b00471, 0x3ff04b27bc403d30 + .quad 0x3fd6df50a8dff772, 0x3ff03613f2812daf + .quad 0x3fd760aba57a76bf, 0x3ff0209a65e29545 + .quad 0x3fd7e15944d9d3e4, 0x3ff00abcf3e187a9 + .quad 0x3fd861566f5fd3c0, 0x3fefe8fb01a47307 + .quad 0x3fd8e0a01cab516b, 0x3fefbbbbef34b4b2 + .quad 0x3fd95f3353cbb146, 0x3fef8dc092d58ff8 + .quad 0x3fd9dd0d2b721f39, 0x3fef5f0cdaf15313 + .quad 0x3fda5a2aca209394, 0x3fef2fa4c16c0019 + .quad 0x3fdad68966569a87, 0x3feeff8c4b1375db + .quad 0x3fdb522646bbda68, 0x3feecec7870ebca8 + .quad 0x3fdbccfec24855b8, 0x3fee9d5a8e4c934e + .quad 0x3fdc4710406a65fc, 0x3fee6b4982f158b9 + .quad 0x3fdcc058392a6d2d, 0x3fee38988fc46e72 + .quad 0x3fdd38d4354c3bd0, 0x3fee054be79d3042 + .quad 0x3fddb081ce6e2a48, 0x3fedd167c4cf9d2a + .quad 0x3fde275eaf25e458, 0x3fed9cf06898cdaf + .quad 0x3fde9d68931ae650, 0x3fed67ea1a8b5368 + .quad 0x3fdf129d471eabb1, 0x3fed325927fb9d89 + .quad 0x3fdf86faa9428f9d, 0x3fecfc41e36c7df9 + .quad 0x3fdffa7ea8eb5fd0, 0x3fecc5a8a3fbea40 + .quad 0x3fe03693a371519c, 0x3fec8e91c4d01368 + .quad 0x3fe06f794ab2cae7, 0x3fec5701a484ef9d + .quad 0x3fe0a7ef5c18edd2, 0x3fec1efca49a5011 + .quad 0x3fe0dff4f247f6c6, 0x3febe68728e29d5e + .quad 0x3fe1178930ada115, 0x3febada596f25436 + .quad 0x3fe14eab43841b55, 0x3feb745c55905bf8 + .quad 0x3fe1855a5fd3dd50, 0x3feb3aafcc27502e + .quad 0x3fe1bb95c3746199, 0x3feb00a46237d5be + .quad 0x3fe1f15cb50bc4de, 0x3feac63e7ecc1411 + .quad 0x3fe226ae840d4d70, 0x3fea8b8287ec6a09 + .quad 0x3fe25b8a88b6dd7f, 0x3fea5074e2157620 + .quad 0x3fe28ff0240d52cd, 0x3fea1519efaf889e + .quad 0x3fe2c3debfd7d6c1, 0x3fe9d97610879642 + .quad 0x3fe2f755ce9a21f4, 0x3fe99d8da149c13f + .quad 0x3fe32a54cb8db67b, 0x3fe96164fafd8de3 + .quad 0x3fe35cdb3a9a144d, 0x3fe925007283d7aa + .quad 0x3fe38ee8a84beb71, 0x3fe8e86458169af8 + .quad 0x3fe3c07ca9cb4f9e, 0x3fe8ab94f6caa71d + .quad 0x3fe3f196dcd0f135, 0x3fe86e9694134b9e + .quad 0x3fe42236e79a5fa6, 0x3fe8316d6f48133d + .quad 0x3fe4525c78dd5966, 0x3fe7f41dc12c9e89 + .quad 0x3fe4820747ba2dc2, 0x3fe7b6abbb7aaf19 + .quad 0x3fe4b13713ad3513, 0x3fe7791b886e7403 + .quad 0x3fe4dfeba47f63cc, 0x3fe73b714a552763 + .quad 0x3fe50e24ca35fd2c, 0x3fe6fdb11b1e0c34 + .quad 0x3fe53be25d016a4f, 0x3fe6bfdf0beddaf5 + .quad 0x3fe569243d2b3a9b, 0x3fe681ff24b4ab04 + .quad 0x3fe595ea53035283, 0x3fe6441563c665d4 + .quad 0x3fe5c2348ecc4dc3, 0x3fe60625bd75d07b + .quad 0x3fe5ee02e8a71a53, 0x3fe5c8341bb23767 + .quad 0x3fe61955607dd15d, 0x3fe58a445da7c74c + .quad 0x3fe6442bfdedd397, 0x3fe54c5a57629db0 + .quad 0x3fe66e86d0312e82, 0x3fe50e79d1749ac9 + .quad 0x3fe69865ee075011, 0x3fe4d0a6889dfd9f + .quad 0x3fe6c1c9759d0e5f, 0x3fe492e42d78d2c5 + .quad 0x3fe6eab18c74091b, 0x3fe4553664273d24 + .quad 0x3fe7131e5f496a5a, 0x3fe417a0c4049fd0 + .quad 0x3fe73b1021fc0cb8, 0x3fe3da26d759aef5 + .quad 0x3fe762870f720c6f, 0x3fe39ccc1b136d5a + .quad 0x3fe78983697dc96f, 0x3fe35f93fe7d1b3d + .quad 0x3fe7b00578c26037, 0x3fe32281e2fd1a92 + .quad 0x3fe7d60d8c979f7b, 0x3fe2e5991bd4cbfc + .quad 0x3fe7fb9bfaed8078, 0x3fe2a8dcede3673b + .quad 0x3fe820b1202f27fb, 0x3fe26c508f6bd0ff + .quad 0x3fe8454d5f25760d, 0x3fe22ff727dd6f7b + .quad 0x3fe8697120d92a4a, 0x3fe1f3d3cf9ffe5a + .quad 0x3fe88d1cd474a2e0, 0x3fe1b7e98fe26217 + .quad 0x3fe8b050ef253c37, 0x3fe17c3b626c7a12 + .quad 0x3fe8d30debfc572e, 0x3fe140cc3173f007 + .quad 0x3fe8f5544bd00c04, 0x3fe1059ed7740313 + .quad 0x3fe91724951b8fc6, 0x3fe0cab61f084b93 + .quad 0x3fe9387f53df5238, 0x3fe09014c2ca74da + .quad 0x3fe959651980da31, 0x3fe055bd6d32e8d7 + .quad 0x3fe979d67caa6631, 0x3fe01bb2b87c6968 + .quad 0x3fe999d4192a5715, 0x3fdfc3ee5d1524b0 + .quad 0x3fe9b95e8fd26aba, 0x3fdf511a91a67d2a + .quad 0x3fe9d8768656cc42, 0x3fdedeeee0959518 + .quad 0x3fe9f71ca72cffb6, 0x3fde6d6ffaa65a25 + .quad 0x3fea1551a16aaeaf, 0x3fddfca26f5bbf88 + .quad 0x3fea331628a45b92, 0x3fdd8c8aace11e63 + .quad 0x3fea506af4cc00f4, 0x3fdd1d2cfff91594 + .quad 0x3fea6d50c20fa293, 0x3fdcae8d93f1d7b7 + .quad 0x3fea89c850b7d54d, 0x3fdc40b0729ed548 + .quad 0x3feaa5d265064366, 0x3fdbd3998457afdb + .quad 0x3feac16fc7143263, 0x3fdb674c8ffc6283 + .quad 0x3feadca142b10f98, 0x3fdafbcd3afe8ab6 + .quad 0x3feaf767a741088b, 0x3fda911f096fbc26 + .quad 0x3feb11c3c79bb424, 0x3fda27455e14c93c + .quad 0x3feb2bb679ead19c, 0x3fd9be437a7de946 + .quad 0x3feb4540978921ee, 0x3fd9561c7f23a47b + .quad 0x3feb5e62fce16095, 0x3fd8eed36b886d93 + .quad 0x3feb771e894d602e, 0x3fd8886b1e5ecfd1 + .quad 0x3feb8f741ef54f83, 0x3fd822e655b417e7 + .quad 0x3feba764a2af2b78, 0x3fd7be47af1f5d89 + .quad 0x3febbef0fbde6221, 0x3fd75a91a7f4d2ed + .quad 0x3febd61a1453ab44, 0x3fd6f7c69d7d3ef8 + .quad 0x3febece0d82d1a5c, 0x3fd695e8cd31867e + .quad 0x3fec034635b66e23, 0x3fd634fa54fa285f + .quad 0x3fec194b1d49a184, 0x3fd5d4fd33729015 + .quad 0x3fec2ef0812fc1bd, 0x3fd575f3483021c3 + .quad 0x3fec443755820d64, 0x3fd517de540ce2a3 + .quad 0x3fec5920900b5fd1, 0x3fd4babff975a04c + .quad 0x3fec6dad2829ec62, 0x3fd45e99bcbb7915 + .quad 0x3fec81de16b14cef, 0x3fd4036d0468a7a2 + .quad 0x3fec95b455cce69d, 0x3fd3a93b1998736c + .quad 0x3feca930e0e2a825, 0x3fd35005285227f1 + .quad 0x3fecbc54b476248d, 0x3fd2f7cc3fe6f423 + .quad 0x3feccf20ce0c0d27, 0x3fd2a09153529381 + .quad 0x3fece1962c0e0d8b, 0x3fd24a55399ea239 + .quad 0x3fecf3b5cdaf0c39, 0x3fd1f518ae487dc8 + .quad 0x3fed0580b2cfd249, 0x3fd1a0dc51a9934d + .quad 0x3fed16f7dbe41ca0, 0x3fd14da0a961fd14 + .quad 0x3fed281c49d818d0, 0x3fd0fb6620c550af + .quad 0x3fed38eefdf64fdd, 0x3fd0aa2d09497f2b + .quad 0x3fed4970f9ce00d9, 0x3fd059f59af7a906 + .quad 0x3fed59a33f19ed42, 0x3fd00abff4dec7a3 + .quad 0x3fed6986cfa798e7, 0x3fcf79183b101c5b + .quad 0x3fed791cad3eff01, 0x3fcedeb406d9c825 + .quad 0x3fed8865d98abe01, 0x3fce4652fadcb6b2 + .quad 0x3fed97635600bb89, 0x3fcdaff4969c0b04 + .quad 0x3feda61623cb41e0, 0x3fcd1b982c501370 + .quad 0x3fedb47f43b2980d, 0x3fcc893ce1dcbef7 + .quad 0x3fedc29fb60715af, 0x3fcbf8e1b1ca2279 + .quad 0x3fedd0787a8bb39d, 0x3fcb6a856c3ed54f + .quad 0x3fedde0a90611a0d, 0x3fcade26b7fbed95 + .quad 0x3fedeb56f5f12d28, 0x3fca53c4135a6526 + .quad 0x3fedf85ea8db188e, 0x3fc9cb5bd549b111 + .quad 0x3fee0522a5dfda73, 0x3fc944ec2e4f5630 + .quad 0x3fee11a3e8cf4eb8, 0x3fc8c07329874652 + .quad 0x3fee1de36c75ba58, 0x3fc83deeada4d25a + .quad 0x3fee29e22a89d766, 0x3fc7bd5c7df3fe9c + .quad 0x3fee35a11b9b61ce, 0x3fc73eba3b5b07b7 + .quad 0x3fee4121370224cc, 0x3fc6c205655be720 + .quad 0x3fee4c6372cd8927, 0x3fc6473b5b15a7a1 + .quad 0x3fee5768c3b4a3fc, 0x3fc5ce595c455b0a + .quad 0x3fee62321d06c5e0, 0x3fc5575c8a468362 + .quad 0x3fee6cc0709c8a0d, 0x3fc4e241e912c305 + .quad 0x3fee7714aec96534, 0x3fc46f066040a832 + .quad 0x3fee812fc64db369, 0x3fc3fda6bc016994 + .quad 0x3fee8b12a44944a8, 0x3fc38e1fae1d6a9d + .quad 0x3fee94be342e6743, 0x3fc3206dceef5f87 + .quad 0x3fee9e335fb56f87, 0x3fc2b48d9e5dea1c + .quad 0x3feea7730ed0bbb9, 0x3fc24a7b84d38971 + .quad 0x3feeb07e27a133aa, 0x3fc1e233d434b813 + .quad 0x3feeb9558e6b42ce, 0x3fc17bb2c8d41535 + .quad 0x3feec1fa258c4bea, 0x3fc116f48a6476cc + .quad 0x3feeca6ccd709544, 0x3fc0b3f52ce8c383 + .quad 0x3feed2ae6489ac1e, 0x3fc052b0b1a174ea + .quad 0x3feedabfc7453e63, 0x3fbfe6460fef4680 + .quad 0x3feee2a1d004692c, 0x3fbf2a901ccafb37 + .quad 0x3feeea5557137ae0, 0x3fbe723726b824a9 + .quad 0x3feef1db32a2277c, 0x3fbdbd32ac4c99b0 + .quad 0x3feef93436bc2daa, 0x3fbd0b7a0f921e7c + .quad 0x3fef006135426b26, 0x3fbc5d0497c09e74 + .quad 0x3fef0762fde45ee6, 0x3fbbb1c972f23e50 + .quad 0x3fef0e3a5e1a1788, 0x3fbb09bfb7d11a84 + .quad 0x3fef14e8211e8c55, 0x3fba64de673e8837 + .quad 0x3fef1b6d0fea5f4d, 0x3fb9c31c6df3b1b8 + .quad 0x3fef21c9f12f0677, 0x3fb92470a61b6965 + .quad 0x3fef27ff89525acf, 0x3fb888d1d8e510a3 + .quad 0x3fef2e0e9a6a8b09, 0x3fb7f036c0107294 + .quad 0x3fef33f7e43a706b, 0x3fb75a96077274ba + .quad 0x3fef39bc242e43e6, 0x3fb6c7e64e7281cb + .quad 0x3fef3f5c1558b19e, 0x3fb6381e2980956b + .quad 0x3fef44d870704911, 0x3fb5ab342383d178 + .quad 0x3fef4a31ebcd47df, 0x3fb5211ebf41880b + .quad 0x3fef4f693b67bd77, 0x3fb499d478bca735 + .quad 0x3fef547f10d60597, 0x3fb4154bc68d75c3 + .quad 0x3fef59741b4b97cf, 0x3fb3937b1b31925a + .quad 0x3fef5e4907982a07, 0x3fb31458e6542847 + .quad 0x3fef62fe80272419, 0x3fb297db960e4f63 + .quad 0x3fef67952cff6282, 0x3fb21df9981f8e53 + .quad 0x3fef6c0db3c34641, 0x3fb1a6a95b1e786f + .quad 0x3fef7068b7b10fd9, 0x3fb131e14fa1625d + .quad 0x3fef74a6d9a38383, 0x3fb0bf97e95f2a64 + .quad 0x3fef78c8b812d498, 0x3fb04fc3a0481321 + .quad 0x3fef7cceef15d631, 0x3fafc4b5e32d6259 + .quad 0x3fef80ba18636f07, 0x3faeeea8c1b1db94 + .quad 0x3fef848acb544e95, 0x3fae1d4cf1e2450a + .quad 0x3fef88419ce4e184, 0x3fad508f9a1ea64f + .quad 0x3fef8bdf1fb78370, 0x3fac885df3451a07 + .quad 0x3fef8f63e416ebff, 0x3fabc4a54a84e834 + .quad 0x3fef92d077f8d56d, 0x3fab055303221015 + .quad 0x3fef96256700da8e, 0x3faa4a549829587e + .quad 0x3fef99633a838a57, 0x3fa993979e14fffe + .quad 0x3fef9c8a7989af0d, 0x3fa8e109c4622913 + .quad 0x3fef9f9ba8d3c733, 0x3fa83298d717210e + .quad 0x3fefa2974addae45, 0x3fa78832c03aa2b1 + .quad 0x3fefa57ddfe27376, 0x3fa6e1c5893c380b + .quad 0x3fefa84fe5e05c8d, 0x3fa63f3f5c4de13b + .quad 0x3fefab0dd89d1309, 0x3fa5a08e85af27e0 + .quad 0x3fefadb831a9f9c3, 0x3fa505a174e9c929 + .quad 0x3fefb04f6868a944, 0x3fa46e66be002240 + .quad 0x3fefb2d3f20f9101, 0x3fa3dacd1a8d8cce + .quad 0x3fefb54641aebbc9, 0x3fa34ac36ad8dafe + .quad 0x3fefb7a6c834b5a2, 0x3fa2be38b6d92415 + .quad 0x3fefb9f5f4739170, 0x3fa2351c2f2d1449 + .quad 0x3fefbc3433260ca5, 0x3fa1af5d2e04f3f6 + .quad 0x3fefbe61eef4cf6a, 0x3fa12ceb37ff9bc3 + .quad 0x3fefc07f907bc794, 0x3fa0adb5fcfa8c75 + .quad 0x3fefc28d7e4f9cd0, 0x3fa031ad58d56279 + .quad 0x3fefc48c1d033c7a, 0x3f9f7182a851bca2 + .quad 0x3fefc67bcf2d7b8f, 0x3f9e85c449e377f3 + .quad 0x3fefc85cf56ecd38, 0x3f9da0005e5f28df + .quad 0x3fefca2fee770c79, 0x3f9cc0180af00a8b + .quad 0x3fefcbf5170b578b, 0x3f9be5ecd2fcb5f9 + .quad 0x3fefcdacca0bfb73, 0x3f9b1160991ff737 + .quad 0x3fefcf57607a6e7c, 0x3f9a4255a00b9f03 + .quad 0x3fefd0f5317f582f, 0x3f9978ae8b55ce1b + .quad 0x3fefd2869270a56f, 0x3f98b44e6031383e + .quad 0x3fefd40bd6d7a785, 0x3f97f5188610ddc8 + .quad 0x3fefd58550773cb5, 0x3f973af0c737bb45 + .quad 0x3fefd6f34f52013a, 0x3f9685bb5134ef13 + .quad 0x3fefd85621b0876d, 0x3f95d55cb54cd53a + .quad 0x3fefd9ae142795e3, 0x3f9529b9e8cf9a1e + .quad 0x3fefdafb719e6a69, 0x3f9482b8455dc491 + .quad 0x3fefdc3e835500b3, 0x3f93e03d891b37de + .quad 0x3fefdd7790ea5bc0, 0x3f93422fd6d12e2b + .quad 0x3fefdea6e062d0c9, 0x3f92a875b5ffab56 + .quad 0x3fefdfccb62e52d3, 0x3f9212f612dee7fb + .quad 0x3fefe0e9552ebdd6, 0x3f9181983e5133dd + .quad 0x3fefe1fcfebe2083, 0x3f90f443edc5ce49 + .quad 0x3fefe307f2b503d0, 0x3f906ae13b0d3255 + .quad 0x3fefe40a6f70af4b, 0x3f8fcab1483ea7fc + .quad 0x3fefe504b1d9696c, 0x3f8ec72615a894c4 + .quad 0x3fefe5f6f568b301, 0x3f8dcaf3691fc448 + .quad 0x3fefe6e1742f7cf6, 0x3f8cd5ec93c12432 + .quad 0x3fefe7c466dc57a1, 0x3f8be7e5ac24963b + .quad 0x3fefe8a004c19ae6, 0x3f8b00b38d6b3575 + .quad 0x3fefe97483db8670, 0x3f8a202bd6372dce + .quad 0x3fefea4218d6594a, 0x3f894624e78e0faf + .quad 0x3fefeb08f7146046, 0x3f887275e3a6869e + .quad 0x3fefebc950b3fa75, 0x3f87a4f6aca256cb + .quad 0x3fefec835695932e, 0x3f86dd7fe3358230 + .quad 0x3fefed37386190fb, 0x3f861beae53b72b7 + .quad 0x3fefede5248e38f4, 0x3f856011cc3b036d + .quad 0x3fefee8d486585ee, 0x3f84a9cf6bda3f4c + .quad 0x3fefef2fd00af31a, 0x3f83f8ff5042a88e + .quad 0x3fefefcce6813974, 0x3f834d7dbc76d7e5 + .quad 0x3feff064b5afffbe, 0x3f82a727a89a3f14 + .quad 0x3feff0f766697c76, 0x3f8205dac02bd6b9 + .quad 0x3feff18520700971, 0x3f81697560347b26 + .quad 0x3feff20e0a7ba8c2, 0x3f80d1d69569b82d + .quad 0x3feff2924a3f7a83, 0x3f803ede1a45bfee + .quad 0x3feff312046f2339, 0x3f7f60d8aa2a88f2 + .quad 0x3feff38d5cc4227f, 0x3f7e4cc4abf7d065 + .quad 0x3feff404760319b4, 0x3f7d4143a9dfe965 + .quad 0x3feff47772010262, 0x3f7c3e1a5f5c077c + .quad 0x3feff4e671a85425, 0x3f7b430ecf4a83a8 + .quad 0x3feff55194fe19df, 0x3f7a4fe83fb9db25 + .quad 0x3feff5b8fb26f5f6, 0x3f79646f35a76624 + .quad 0x3feff61cc26c1578, 0x3f78806d70b2fc36 + .quad 0x3feff67d08401202, 0x3f77a3ade6c8b3e5 + .quad 0x3feff6d9e943c231, 0x3f76cdfcbfc1e263 + .quad 0x3feff733814af88c, 0x3f75ff2750fe7820 + .quad 0x3feff789eb6130c9, 0x3f7536fc18f7ce5c + .quad 0x3feff7dd41ce2b4d, 0x3f74754abacdf1dc + .quad 0x3feff82d9e1a76d8, 0x3f73b9e3f9d06e3f + .quad 0x3feff87b1913e853, 0x3f730499b503957f + .quad 0x3feff8c5cad200a5, 0x3f72553ee2a336bf + .quad 0x3feff90dcaba4096, 0x3f71aba78ba3af89 + .quad 0x3feff9532f846ab0, 0x3f7107a8c7323a6e + .quad 0x3feff9960f3eb327, 0x3f706918b6355624 + .quad 0x3feff9d67f51ddba, 0x3f6f9f9cfd9c3035 + .quad 0x3feffa14948549a7, 0x3f6e77448fb66bb9 + .quad 0x3feffa506302ebae, 0x3f6d58da68fd1170 + .quad 0x3feffa89fe5b3625, 0x3f6c4412bf4b8f0b + .quad 0x3feffac17988ef4b, 0x3f6b38a3af2e55b4 + .quad 0x3feffaf6e6f4f5c0, 0x3f6a3645330550ff + .quad 0x3feffb2a5879f35e, 0x3f693cb11a30d765 + .quad 0x3feffb5bdf67fe6f, 0x3f684ba3004a50d0 + .quad 0x3feffb8b8c88295f, 0x3f6762d84469c18f + .quad 0x3feffbb970200110, 0x3f66821000795a03 + .quad 0x3feffbe599f4f9d9, 0x3f65a90b00981d93 + .quad 0x3feffc10194fcb64, 0x3f64d78bba8ca5fd + .quad 0x3feffc38fcffbb7c, 0x3f640d564548fad7 + .quad 0x3feffc60535dd7f5, 0x3f634a305080681f + .quad 0x3feffc862a501fd7, 0x3f628de11c5031eb + .quad 0x3feffcaa8f4c9bea, 0x3f61d83170fbf6fb + .quad 0x3feffccd8f5c66d1, 0x3f6128eb96be8798 + .quad 0x3feffcef371ea4d7, 0x3f607fdb4dafea5f + .quad 0x3feffd0f92cb6ba7, 0x3f5fb99b8b8279e1 + .quad 0x3feffd2eae369a07, 0x3f5e7f232d9e2630 + .quad 0x3feffd4c94d29fdb, 0x3f5d4fed7195d7e8 + .quad 0x3feffd6951b33686, 0x3f5c2b9cf7f893bf + .quad 0x3feffd84ef9009ee, 0x3f5b11d702b3deb2 + .quad 0x3feffd9f78c7524a, 0x3f5a024365f771bd + .quad 0x3feffdb8f7605ee7, 0x3f58fc8c794b03b5 + .quad 0x3feffdd1750e1220, 0x3f58005f08d6f1ef + .quad 0x3feffde8fb314ebf, 0x3f570d6a46e07dda + .quad 0x3feffdff92db56e5, 0x3f56235fbd7a4345 + .quad 0x3feffe1544d01ccb, 0x3f5541f340697987 + .quad 0x3feffe2a1988857c, 0x3f5468dadf4080ab + .quad 0x3feffe3e19349dc7, 0x3f5397ced7af2b15 + .quad 0x3feffe514bbdc197, 0x3f52ce898809244e + .quad 0x3feffe63b8c8b5f7, 0x3f520cc76202c5fb + .quad 0x3feffe7567b7b5e1, 0x3f515246dda49d47 + .quad 0x3feffe865fac722b, 0x3f509ec86c75d497 + .quad 0x3feffe96a78a04a9, 0x3f4fe41cd9bb4eee + .quad 0x3feffea645f6d6da, 0x3f4e97ba3b77f306 + .quad 0x3feffeb5415e7c44, 0x3f4d57f524723822 + .quad 0x3feffec39ff380b9, 0x3f4c245d4b99847a + .quad 0x3feffed167b12ac2, 0x3f4afc85e0f82e12 + .quad 0x3feffede9e5d3262, 0x3f49e005769dbc1d + .quad 0x3feffeeb49896c6d, 0x3f48ce75e9f6f8a0 + .quad 0x3feffef76e956a9f, 0x3f47c7744d9378f7 + .quad 0x3fefff0312b010b5, 0x3f46caa0d3582fe9 + .quad 0x3fefff0e3ad91ec2, 0x3f45d79eb71e893b + .quad 0x3fefff18ebe2b0e1, 0x3f44ee1429bf7cc0 + .quad 0x3fefff232a72b48e, 0x3f440daa3c89f5b6 + .quad 0x3fefff2cfb0453d9, 0x3f43360ccd23db3a + .quad 0x3fefff3661e9569d, 0x3f4266ea71d4f71a + .quad 0x3fefff3f634b79f9, 0x3f419ff4663ae9df + .quad 0x3fefff48032dbe40, 0x3f40e0de78654d1e + .quad 0x3fefff50456dab8c, 0x3f40295ef6591848 + .quad 0x3fefff582dc48d30, 0x3f3ef25d37f49fe1 + .quad 0x3fefff5fbfc8a439, 0x3f3da01102b5f851 + .quad 0x3fefff66feee5129, 0x3f3c5b5412dcafad + .quad 0x3fefff6dee89352e, 0x3f3b23a5a23e4210 + .quad 0x3fefff7491cd4af6, 0x3f39f8893d8fd1c1 + .quad 0x3fefff7aebcff755, 0x3f38d986a4187285 + .quad 0x3fefff80ff8911fd, 0x3f37c629a822bc9e + .quad 0x3fefff86cfd3e657, 0x3f36be02102b3520 + .quad 0x3fefff8c5f702ccf, 0x3f35c0a378c90bca + .quad 0x3fefff91b102fca8, 0x3f34cda5374ea275 + .quad 0x3fefff96c717b695, 0x3f33e4a23d1f4703 + .quad 0x3fefff9ba420e834, 0x3f330538fbb77ecd + .quad 0x3fefffa04a7928b1, 0x3f322f0b496539be + .quad 0x3fefffa4bc63ee9a, 0x3f3161be46ad3b50 + .quad 0x3fefffa8fc0e5f33, 0x3f309cfa445b00ff + .quad 0x3fefffad0b901755, 0x3f2fc0d55470cf51 + .quad 0x3fefffb0ecebee1b, 0x3f2e577bbcd49935 + .quad 0x3fefffb4a210b172, 0x3f2cfd4a5adec5c0 + .quad 0x3fefffb82cd9dcbf, 0x3f2bb1a9657ce465 + .quad 0x3fefffbb8f1049c6, 0x3f2a740684026555 + .quad 0x3fefffbeca6adbe9, 0x3f2943d4a1d1ed39 + .quad 0x3fefffc1e08f25f5, 0x3f28208bc334a6a5 + .quad 0x3fefffc4d3120aa1, 0x3f2709a8db59f25c + .quad 0x3fefffc7a37857d2, 0x3f25feada379d8b7 + .quad 0x3fefffca53375ce3, 0x3f24ff207314a102 + .quad 0x3fefffcce3b57bff, 0x3f240a8c1949f75e + .quad 0x3fefffcf564ab6b7, 0x3f23207fb7420eb9 + .quad 0x3fefffd1ac4135f9, 0x3f22408e9ba3327f + .quad 0x3fefffd3e6d5cd87, 0x3f216a501f0e42ca + .quad 0x3fefffd607387b07, 0x3f209d5f819c9e29 + .quad 0x3fefffd80e8ce0da, 0x3f1fb2b792b40a22 + .quad 0x3fefffd9fdeabcce, 0x3f1e3bcf436a1a95 + .quad 0x3fefffdbd65e5ad0, 0x3f1cd55277c18d05 + .quad 0x3fefffdd98e903b2, 0x3f1b7e94604479dc + .quad 0x3fefffdf46816833, 0x3f1a36eec00926dd + .quad 0x3fefffe0e0140857, 0x3f18fdc1b2dcf7b9 + .quad 0x3fefffe26683972a, 0x3f17d2737527c3f9 + .quad 0x3fefffe3daa95b18, 0x3f16b4702d7d5849 + .quad 0x3fefffe53d558ae9, 0x3f15a329b7d30748 + .quad 0x3fefffe68f4fa777, 0x3f149e17724f4d41 + .quad 0x3fefffe7d156d244, 0x3f13a4b60ba9aa4e + .quad 0x3fefffe904222101, 0x3f12b6875310f785 + .quad 0x3fefffea2860ee1e, 0x3f11d312098e9dba + .quad 0x3fefffeb3ebb267b, 0x3f10f9e1b4dd36df + .quad 0x3fefffec47d19457, 0x3f102a8673a94692 + .quad 0x3fefffed443e2787, 0x3f0ec929a665b449 + .quad 0x3fefffee34943b15, 0x3f0d4f4b4c8e09ed + .quad 0x3fefffef1960d85d, 0x3f0be6abbb10a5aa + .quad 0x3fefffeff32af7af, 0x3f0a8e8cc1fadef6 + .quad 0x3feffff0c273bea2, 0x3f094637d5bacfdb + .quad 0x3feffff187b6bc0e, 0x3f080cfdc72220cf + .quad 0x3feffff2436a21dc, 0x3f06e2367dc27f95 + .quad 0x3feffff2f5fefcaa, 0x3f05c540b4936fd2 + .quad 0x3feffff39fe16963, 0x3f04b581b8d170fc + .quad 0x3feffff44178c8d2, 0x3f03b2652b06c2b2 + .quad 0x3feffff4db27f146, 0x3f02bb5cc22e5db6 + .quad 0x3feffff56d4d5e5e, 0x3f01cfe010e2052d + .quad 0x3feffff5f8435efc, 0x3f00ef6c4c84a0fe + .quad 0x3feffff67c604180, 0x3f001984165a5f36 + .quad 0x3feffff6f9f67e55, 0x3efe9b5e8d00ce77 + .quad 0x3feffff77154e0d6, 0x3efd16f5716c6c1a + .quad 0x3feffff7e2c6aea2, 0x3efba4f035d60e03 + .quad 0x3feffff84e93cd75, 0x3efa447b7b03f045 + .quad 0x3feffff8b500e77c, 0x3ef8f4ccca7fc90d + .quad 0x3feffff9164f8e46, 0x3ef7b5223dac7336 + .quad 0x3feffff972be5c59, 0x3ef684c227fcacef + .quad 0x3feffff9ca891572, 0x3ef562fac4329b48 + .quad 0x3feffffa1de8c582, 0x3ef44f21e49054f2 + .quad 0x3feffffa6d13de73, 0x3ef34894a5e24657 + .quad 0x3feffffab83e54b8, 0x3ef24eb7254ccf83 + .quad 0x3feffffaff99bac4, 0x3ef160f438c70913 + .quad 0x3feffffb43555b5f, 0x3ef07ebd2a2d2844 + .quad 0x3feffffb839e52f3, 0x3eef4f12e9ab070a + .quad 0x3feffffbc09fa7cd, 0x3eedb5ad0b27805c + .quad 0x3feffffbfa82616b, 0x3eec304efa2c6f4e + .quad 0x3feffffc316d9ed0, 0x3eeabe09e9144b5e + .quad 0x3feffffc6586abf6, 0x3ee95df988e76644 + .quad 0x3feffffc96f1165e, 0x3ee80f439b4ee04b + .quad 0x3feffffcc5cec0c1, 0x3ee6d11788a69c64 + .quad 0x3feffffcf23ff5fc, 0x3ee5a2adfa0b4bc4 + .quad 0x3feffffd1c637b2b, 0x3ee4834877429b8f + .quad 0x3feffffd4456a10d, 0x3ee37231085c7d9a + .quad 0x3feffffd6a3554a1, 0x3ee26eb9daed6f7e + .quad 0x3feffffd8e1a2f22, 0x3ee1783ceac28910 + .quad 0x3feffffdb01e8546, 0x3ee08e1badf0fced + .quad 0x3feffffdd05a75ea, 0x3edf5f7d88472604 + .quad 0x3feffffdeee4f810, 0x3eddb92b5212fb8d + .quad 0x3feffffe0bd3e852, 0x3edc282cd3957eda + .quad 0x3feffffe273c15b7, 0x3edaab7abace48dc + .quad 0x3feffffe41314e06, 0x3ed94219bfcb4928 + .quad 0x3feffffe59c6698b, 0x3ed7eb1a2075864e + .quad 0x3feffffe710d565e, 0x3ed6a597219a93da + .quad 0x3feffffe8717232d, 0x3ed570b69502f313 + .quad 0x3feffffe9bf4098c, 0x3ed44ba864670882 + .quad 0x3feffffeafb377d5, 0x3ed335a62115bce2 + .quad 0x3feffffec2641a9e, 0x3ed22df298214423 + .quad 0x3feffffed413e5b7, 0x3ed133d96ae7e0dd + .quad 0x3feffffee4d01cd6, 0x3ed046aeabcfcdec + .quad 0x3feffffef4a55bd4, 0x3ececb9cfe1d8642 + .quad 0x3fefffff039f9e8f, 0x3ecd21397ead99cb + .quad 0x3fefffff11ca4876, 0x3ecb8d094c86d374 + .quad 0x3fefffff1f302bc1, 0x3eca0df0f0c626dc + .quad 0x3fefffff2bdb904d, 0x3ec8a2e269750a39 + .quad 0x3fefffff37d63a36, 0x3ec74adc8f4064d3 + .quad 0x3fefffff43297019, 0x3ec604ea819f007c + .quad 0x3fefffff4dde0118, 0x3ec4d0231928c6f9 + .quad 0x3fefffff57fc4a95, 0x3ec3aba85fe22e20 + .quad 0x3fefffff618c3da6, 0x3ec296a70f414053 + .quad 0x3fefffff6a956450, 0x3ec1905613b3abf2 + .quad 0x3fefffff731ee681, 0x3ec097f6156f32c5 + .quad 0x3fefffff7b2f8ed6, 0x3ebf59a20caf6695 + .quad 0x3fefffff82cdcf1b, 0x3ebd9c73698fb1dc + .quad 0x3fefffff89ffc4aa, 0x3ebbf716c6168bae + .quad 0x3fefffff90cb3c81, 0x3eba6852c6b58392 + .quad 0x3fefffff9735b73b, 0x3eb8eefd70594a89 + .quad 0x3fefffff9d446ccc, 0x3eb789fb715aae95 + .quad 0x3fefffffa2fc5015, 0x3eb6383f726a8e04 + .quad 0x3fefffffa8621251, 0x3eb4f8c96f26a26a + .quad 0x3fefffffad7a2652, 0x3eb3caa61607f920 + .quad 0x3fefffffb248c39d, 0x3eb2acee2f5ecdb8 + .quad 0x3fefffffb6d1e95d, 0x3eb19ec60b1242ed + .quad 0x3fefffffbb196132, 0x3eb09f5cf4dd2877 + .quad 0x3fefffffbf22c1e2, 0x3eaf5bd95d8730d8 + .quad 0x3fefffffc2f171e3, 0x3ead9371e2ff7c35 + .quad 0x3fefffffc688a9cf, 0x3eabe41de54d155a + .quad 0x3fefffffc9eb76ac, 0x3eaa4c89e08ef4f3 + .quad 0x3fefffffcd1cbc28, 0x3ea8cb738399b12c + .quad 0x3fefffffd01f36af, 0x3ea75fa8dbc84bec + .quad 0x3fefffffd2f57d68, 0x3ea608078a70dcbc + .quad 0x3fefffffd5a2041f, 0x3ea4c37c0394d094 + .quad 0x3fefffffd8271d12, 0x3ea39100d5687bfe + .quad 0x3fefffffda86faa9, 0x3ea26f9df8519bd7 + .quad 0x3fefffffdcc3b117, 0x3ea15e6827001f18 + .quad 0x3fefffffdedf37ed, 0x3ea05c803e4831c1 + .quad 0x3fefffffe0db6b91, 0x3e9ed22548cffd35 + .quad 0x3fefffffe2ba0ea5, 0x3e9d06ad6ecdf971 + .quad 0x3fefffffe47ccb60, 0x3e9b551c847fbc96 + .quad 0x3fefffffe62534d4, 0x3e99bc09f112b494 + .quad 0x3fefffffe7b4c81e, 0x3e983a1ff0aa239d + .quad 0x3fefffffe92ced93, 0x3e96ce1aa3fd7bdd + .quad 0x3fefffffea8ef9cf, 0x3e9576c72b514859 + .quad 0x3fefffffebdc2ec6, 0x3e943302cc4a0da8 + .quad 0x3fefffffed15bcba, 0x3e9301ba221dc9bb + .quad 0x3fefffffee3cc32c, 0x3e91e1e857adc568 + .quad 0x3fefffffef5251c2, 0x3e90d2966b1746f7 + .quad 0x3feffffff0576917, 0x3e8fa5b4f49cc6b2 + .quad 0x3feffffff14cfb92, 0x3e8dc3ae30b55c16 + .quad 0x3feffffff233ee1d, 0x3e8bfd7555a3bd68 + .quad 0x3feffffff30d18e8, 0x3e8a517d9e61628a + .quad 0x3feffffff3d9480f, 0x3e88be4f8f6c951f + .quad 0x3feffffff4993c46, 0x3e874287ded49339 + .quad 0x3feffffff54dab72, 0x3e85dcd669f2cd34 + .quad 0x3feffffff5f74141, 0x3e848bfd38302871 + .quad 0x3feffffff6969fb8, 0x3e834ecf8a3c124a + .quad 0x3feffffff72c5fb6, 0x3e822430f521cbcf + .quad 0x3feffffff7b91176, 0x3e810b1488aeb235 + .quad 0x3feffffff83d3d07, 0x3e80027c00a263a6 + .quad 0x3feffffff8b962be, 0x3e7e12ee004efc37 + .quad 0x3feffffff92dfba2, 0x3e7c3e44ae32b16b + .quad 0x3feffffff99b79d2, 0x3e7a854ea14102a8 + .quad 0x3feffffffa0248e8, 0x3e78e6761569f45d + .quad 0x3feffffffa62ce54, 0x3e77603bac345f65 + .quad 0x3feffffffabd69b4, 0x3e75f1353cdad001 + .quad 0x3feffffffb127525, 0x3e74980cb3c80949 + .quad 0x3feffffffb624592, 0x3e73537f00b6ad4d + .quad 0x3feffffffbad2aff, 0x3e72225b12bffc68 + .quad 0x3feffffffbf370cd, 0x3e710380e1adb7e9 + .quad 0x3feffffffc355dfd, 0x3e6febc107d5efaa + .quad 0x3feffffffc733572, 0x3e6df0f2a0ee6947 + .quad 0x3feffffffcad3626, 0x3e6c14b2188bcee4 + .quad 0x3feffffffce39b67, 0x3e6a553644f7f07d + .quad 0x3feffffffd169d0c, 0x3e68b0cfce0579e0 + .quad 0x3feffffffd466fa5, 0x3e6725e7c5dd20f7 + .quad 0x3feffffffd7344aa, 0x3e65b2fe547a1340 + .quad 0x3feffffffd9d4aab, 0x3e6456a974e92e93 + .quad 0x3feffffffdc4ad7a, 0x3e630f93c3699078 + .quad 0x3feffffffde9964e, 0x3e61dc7b5b978cf8 + .quad 0x3feffffffe0c2bf0, 0x3e60bc30c5d52f15 + .quad 0x3feffffffe2c92db, 0x3e5f5b2be65a0c7f + .quad 0x3feffffffe4aed5e, 0x3e5d5f3a8dea7357 + .quad 0x3feffffffe675bbd, 0x3e5b82915b03515b + .quad 0x3feffffffe81fc4e, 0x3e59c3517e789488 + .quad 0x3feffffffe9aeb97, 0x3e581fb7df06136e + .quad 0x3feffffffeb24467, 0x3e56961b8d641d06 + .quad 0x3feffffffec81ff2, 0x3e5524ec4d916cae + .quad 0x3feffffffedc95e7, 0x3e53cab1343d18d1 + .quad 0x3feffffffeefbc85, 0x3e52860757487a01 + .quad 0x3fefffffff01a8b6, 0x3e5155a09065d4f7 + .quad 0x3fefffffff126e1e, 0x3e50384250e4c9fc + .quad 0x3fefffffff221f30, 0x3e4e59890b926c78 + .quad 0x3fefffffff30cd3f, 0x3e4c642116a8a9e3 + .quad 0x3fefffffff3e8892, 0x3e4a8e405e651ab6 + .quad 0x3fefffffff4b606f, 0x3e48d5f98114f872 + .quad 0x3fefffffff57632d, 0x3e47397c5a66e307 + .quad 0x3fefffffff629e44, 0x3e45b71456c5a4c4 + .quad 0x3fefffffff6d1e56, 0x3e444d26de513197 + .quad 0x3fefffffff76ef3f, 0x3e42fa31d6371537 + .quad 0x3fefffffff801c1f, 0x3e41bcca373b7b43 + .quad 0x3fefffffff88af67, 0x3e40939ab853339f + .quad 0x3fefffffff90b2e3, 0x3e3efac5187b2863 + .quad 0x3fefffffff982fc1, 0x3e3cf1e86235d0e7 + .quad 0x3fefffffff9f2e9f, 0x3e3b0a68a2128bab + .quad 0x3fefffffffa5b790, 0x3e39423165bc4444 + .quad 0x3fefffffffabd229, 0x3e37974e743dea3d + .quad 0x3fefffffffb18582, 0x3e3607e9eacd1050 + .quad 0x3fefffffffb6d844, 0x3e34924a74dec729 + .quad 0x3fefffffffbbd0aa, 0x3e3334d19e0c2160 + .quad 0x3fefffffffc0748f, 0x3e31edfa3c5f5cca + .quad 0x3fefffffffc4c96c, 0x3e30bc56f1b54701 + .quad 0x3fefffffffc8d462, 0x3e2f3d2185e047d9 + .quad 0x3fefffffffcc9a41, 0x3e2d26cb87945e87 + .quad 0x3fefffffffd01f89, 0x3e2b334fac4b9f99 + .quad 0x3fefffffffd36871, 0x3e296076f7918d1c + .quad 0x3fefffffffd678ed, 0x3e27ac2d72fc2c63 + .quad 0x3fefffffffd954ae, 0x3e2614801550319e + .quad 0x3fefffffffdbff2a, 0x3e24979ac8b28927 + .quad 0x3fefffffffde7ba0, 0x3e2333c68e2d0548 + .quad 0x3fefffffffe0cd16, 0x3e21e767bce37dd7 + .quad 0x3fefffffffe2f664, 0x3e20b0fc5b6d05a0 + .quad 0x3fefffffffe4fa30, 0x3e1f1e3523b41d7d + .quad 0x3fefffffffe6daf7, 0x3e1d00de6608effe + .quad 0x3fefffffffe89b0c, 0x3e1b0778b7b3301b + .quad 0x3fefffffffea3c9a, 0x3e192fb04ec0f6cf + .quad 0x3fefffffffebc1a9, 0x3e177756ec9f78fa + .quad 0x3fefffffffed2c21, 0x3e15dc61922d5a06 + .quad 0x3fefffffffee7dc8, 0x3e145ce65699ff6d + .quad 0x3fefffffffefb847, 0x3e12f71a5f159970 + .quad 0x3feffffffff0dd2b, 0x3e11a94ff571654f + .quad 0x3feffffffff1ede9, 0x3e1071f4bbea09ec + .quad 0x3feffffffff2ebda, 0x3e0e9f1ff8ddd774 + .quad 0x3feffffffff3d843, 0x3e0c818223a202c7 + .quad 0x3feffffffff4b453, 0x3e0a887bd2b4404d + .quad 0x3feffffffff58126, 0x3e08b1a336c5eb6b + .quad 0x3feffffffff63fc3, 0x3e06fab63324088a + .quad 0x3feffffffff6f121, 0x3e056197e30205ba + .quad 0x3feffffffff79626, 0x3e03e44e45301b92 + .quad 0x3feffffffff82fab, 0x3e0281000bfe4c3f + .quad 0x3feffffffff8be77, 0x3e0135f28f2d50b4 + .quad 0x3feffffffff94346, 0x3e000187dded5975 + .quad 0x3feffffffff9bec8, 0x3dfdc479de0ef001 + .quad 0x3feffffffffa319f, 0x3dfbad4fdad3caa1 + .quad 0x3feffffffffa9c63, 0x3df9baed3ed27ab8 + .quad 0x3feffffffffaffa4, 0x3df7ead9ce4285bb + .quad 0x3feffffffffb5be5, 0x3df63ac6b4edc88e + .quad 0x3feffffffffbb1a2, 0x3df4a88be2a6390c + .quad 0x3feffffffffc014e, 0x3df332259185f1a0 + .quad 0x3feffffffffc4b56, 0x3df1d5b1f3793044 + .quad 0x3feffffffffc901c, 0x3df0916f04b6e18b + .quad 0x3feffffffffccfff, 0x3deec77101de6926 + .quad 0x3feffffffffd0b56, 0x3dec960bf23153e0 + .quad 0x3feffffffffd4271, 0x3dea8bd20fc65ef7 + .quad 0x3feffffffffd759d, 0x3de8a61745ec7d1d + .quad 0x3feffffffffda520, 0x3de6e25d0e756261 + .quad 0x3feffffffffdd13c, 0x3de53e4f7d1666cb + .quad 0x3feffffffffdfa2d, 0x3de3b7c27a7ddb0e + .quad 0x3feffffffffe202d, 0x3de24caf2c32af14 + .quad 0x3feffffffffe4371, 0x3de0fb3186804d0f + .quad 0x3feffffffffe642a, 0x3ddf830c0bb41fd7 + .quad 0x3feffffffffe8286, 0x3ddd3c0f1a91c846 + .quad 0x3feffffffffe9eb0, 0x3ddb1e5acf351d87 + .quad 0x3feffffffffeb8d0, 0x3dd92712d259ce66 + .quad 0x3feffffffffed10a, 0x3dd7538c60a04476 + .quad 0x3feffffffffee782, 0x3dd5a14b04b47879 + .quad 0x3feffffffffefc57, 0x3dd40dfd87456f4c + .quad 0x3fefffffffff0fa7, 0x3dd2977b1172b9d5 + .quad 0x3fefffffffff218f, 0x3dd13bc07e891491 + .quad 0x3fefffffffff3227, 0x3dcff1dbb4300811 + .quad 0x3fefffffffff4188, 0x3dcd9a880f306bd8 + .quad 0x3fefffffffff4fc9, 0x3dcb6e45220b55e0 + .quad 0x3fefffffffff5cfd, 0x3dc96a0b33f2c4da + .quad 0x3fefffffffff6939, 0x3dc78b07e9e924ac + .quad 0x3fefffffffff748e, 0x3dc5ce9ab1670dd2 + .quad 0x3fefffffffff7f0d, 0x3dc4325167006bb0 + .quad 0x3fefffffffff88c5, 0x3dc2b3e53538ff3f + .quad 0x3fefffffffff91c6, 0x3dc15137a7f44864 + .quad 0x3fefffffffff9a1b, 0x3dc0084ff125639d + .quad 0x3fefffffffffa1d2, 0x3dbdaeb0b7311ec7 + .quad 0x3fefffffffffa8f6, 0x3dbb7937d1c40c53 + .quad 0x3fefffffffffaf92, 0x3db96d082f59ab06 + .quad 0x3fefffffffffb5b0, 0x3db7872d9fa10aad + .quad 0x3fefffffffffbb58, 0x3db5c4e8e37bc7d0 + .quad 0x3fefffffffffc095, 0x3db423ac0df49a40 + .quad 0x3fefffffffffc56d, 0x3db2a117230ad284 + .quad 0x3fefffffffffc9e8, 0x3db13af4f04f9998 + .quad 0x3fefffffffffce0d, 0x3dafde703724e560 + .quad 0x3fefffffffffd1e1, 0x3dad77f0c82e7641 + .quad 0x3fefffffffffd56c, 0x3dab3ee02611d7dd + .quad 0x3fefffffffffd8b3, 0x3da92ff33023d5bd + .quad 0x3fefffffffffdbba, 0x3da7481a9e69f53f + .quad 0x3fefffffffffde86, 0x3da5847eda620959 + .quad 0x3fefffffffffe11d, 0x3da3e27c1fcc74bd + .quad 0x3fefffffffffe380, 0x3da25f9ee0b923dc + .quad 0x3fefffffffffe5b6, 0x3da0f9a068653200 + .quad 0x3fefffffffffe7c0, 0x3d9f5cc7718082b0 + .quad 0x3fefffffffffe9a2, 0x3d9cf7e53d6a2ca5 + .quad 0x3fefffffffffeb60, 0x3d9ac0f5f3229372 + .quad 0x3fefffffffffecfb, 0x3d98b498644847ea + .quad 0x3fefffffffffee77, 0x3d96cfa9bcca59dc + .quad 0x3fefffffffffefd6, 0x3d950f411d4fd2cd + .quad 0x3feffffffffff11a, 0x3d9370ab8327af5e + .quad 0x3feffffffffff245, 0x3d91f167f88c6b6e + .quad 0x3feffffffffff359, 0x3d908f24085d4597 + .quad 0x3feffffffffff457, 0x3d8e8f70e181d61a + .quad 0x3feffffffffff542, 0x3d8c324c20e337dc + .quad 0x3feffffffffff61b, 0x3d8a03261574b54e + .quad 0x3feffffffffff6e3, 0x3d87fe903cdf5855 + .quad 0x3feffffffffff79b, 0x3d86215c58da3450 + .quad 0x3feffffffffff845, 0x3d846897d4b69fc6 + .quad 0x3feffffffffff8e2, 0x3d82d1877d731b7b + .quad 0x3feffffffffff973, 0x3d8159a386b11517 + .quad 0x3feffffffffff9f8, 0x3d7ffd27ae9393ce + .quad 0x3feffffffffffa73, 0x3d7d7c593130dd0b + .quad 0x3feffffffffffae4, 0x3d7b2cd607c79bcf + .quad 0x3feffffffffffb4c, 0x3d790ae4d3405651 + .quad 0x3feffffffffffbad, 0x3d771312dd1759e2 + .quad 0x3feffffffffffc05, 0x3d75422ef5d8949d + .quad 0x3feffffffffffc57, 0x3d739544b0ecc957 + .quad 0x3feffffffffffca2, 0x3d720997f73e73dd + .quad 0x3feffffffffffce7, 0x3d709ca0eaacd277 + .quad 0x3feffffffffffd27, 0x3d6e9810295890ec + .quad 0x3feffffffffffd62, 0x3d6c2b45b5aa4a1d + .quad 0x3feffffffffffd98, 0x3d69eee068fa7596 + .quad 0x3feffffffffffdca, 0x3d67df2b399c10a8 + .quad 0x3feffffffffffdf8, 0x3d65f8b87a31bd85 + .quad 0x3feffffffffffe22, 0x3d64385c96e9a2d9 + .quad 0x3feffffffffffe49, 0x3d629b2933ef4cbc + .quad 0x3feffffffffffe6c, 0x3d611e68a6378f8a + .quad 0x3feffffffffffe8d, 0x3d5f7f338086a86b + .quad 0x3feffffffffffeab, 0x3d5cf8d7d9ce040a + .quad 0x3feffffffffffec7, 0x3d5aa577251ae485 + .quad 0x3feffffffffffee1, 0x3d58811d739efb5f + .quad 0x3feffffffffffef8, 0x3d568823e52970be + .quad 0x3fefffffffffff0e, 0x3d54b72ae68e8b4c + .quad 0x3fefffffffffff22, 0x3d530b14dbe876bc + .quad 0x3fefffffffffff34, 0x3d5181012ef86610 + .quad 0x3fefffffffffff45, 0x3d501647ba798745 + .quad 0x3fefffffffffff54, 0x3d4d90e917701675 + .quad 0x3fefffffffffff62, 0x3d4b2a87e86d0c8a + .quad 0x3fefffffffffff6f, 0x3d48f53dcb377293 + .quad 0x3fefffffffffff7b, 0x3d46ed2f2515e933 + .quad 0x3fefffffffffff86, 0x3d450ecc9ed47f19 + .quad 0x3fefffffffffff90, 0x3d4356cd5ce7799e + .quad 0x3fefffffffffff9a, 0x3d41c229a587ab78 + .quad 0x3fefffffffffffa2, 0x3d404e15ecc7f3f6 + .quad 0x3fefffffffffffaa, 0x3d3deffc7e6a6017 + .quad 0x3fefffffffffffb1, 0x3d3b7b040832f310 + .quad 0x3fefffffffffffb8, 0x3d3938e021f36d76 + .quad 0x3fefffffffffffbe, 0x3d37258610b3b233 + .quad 0x3fefffffffffffc3, 0x3d353d3bfc82a909 + .quad 0x3fefffffffffffc8, 0x3d337c92babdc2fd + .quad 0x3fefffffffffffcd, 0x3d31e06010120f6a + .quad 0x3fefffffffffffd1, 0x3d3065b9616170d4 + .quad 0x3fefffffffffffd5, 0x3d2e13dd96b3753b + .quad 0x3fefffffffffffd9, 0x3d2b950d32467392 + .quad 0x3fefffffffffffdc, 0x3d294a72263259a5 + .quad 0x3fefffffffffffdf, 0x3d272fd93e036cdc + .quad 0x3fefffffffffffe2, 0x3d254164576929ab + .quad 0x3fefffffffffffe4, 0x3d237b83c521fe96 + .quad 0x3fefffffffffffe7, 0x3d21daf033182e96 + .quad 0x3fefffffffffffe9, 0x3d205ca50205d26a + .quad 0x3fefffffffffffeb, 0x3d1dfbb6235639fa + .quad 0x3fefffffffffffed, 0x3d1b7807e294781f + .quad 0x3fefffffffffffee, 0x3d19298add70a734 + .quad 0x3feffffffffffff0, 0x3d170beaf9c7ffb6 + .quad 0x3feffffffffffff1, 0x3d151b2cd6709222 + .quad 0x3feffffffffffff3, 0x3d1353a6cf7f7fff + .quad 0x3feffffffffffff4, 0x3d11b1fa8cbe84a7 + .quad 0x3feffffffffffff5, 0x3d10330f0fd69921 + .quad 0x3feffffffffffff6, 0x3d0da81670f96f9b + .quad 0x3feffffffffffff7, 0x3d0b24a16b4d09aa + .quad 0x3feffffffffffff7, 0x3d08d6eeb6efdbd6 + .quad 0x3feffffffffffff8, 0x3d06ba91ac734786 + .quad 0x3feffffffffffff9, 0x3d04cb7966770ab5 + .quad 0x3feffffffffffff9, 0x3d0305e9721d0981 + .quad 0x3feffffffffffffa, 0x3d01667311fff70a + .quad 0x3feffffffffffffb, 0x3cffd3de10d62855 + .quad 0x3feffffffffffffb, 0x3cfd1aefbcd48d0c + .quad 0x3feffffffffffffb, 0x3cfa9cc93c25aca9 + .quad 0x3feffffffffffffc, 0x3cf85487ee3ea735 + .quad 0x3feffffffffffffc, 0x3cf63daf8b4b1e0c + .quad 0x3feffffffffffffd, 0x3cf45421e69a6ca1 + .quad 0x3feffffffffffffd, 0x3cf294175802d99a + .quad 0x3feffffffffffffd, 0x3cf0fa17bf41068f + .quad 0x3feffffffffffffd, 0x3cef05e82aae2bb9 + .quad 0x3feffffffffffffe, 0x3cec578101b29058 + .quad 0x3feffffffffffffe, 0x3ce9e39dc5dd2f7c + .quad 0x3feffffffffffffe, 0x3ce7a553a728bbf2 + .quad 0x3feffffffffffffe, 0x3ce5982008db1304 + .quad 0x3feffffffffffffe, 0x3ce3b7e00422e51b + .quad 0x3feffffffffffffe, 0x3ce200c898d9ee3e + .quad 0x3fefffffffffffff, 0x3ce06f5f7eb65a56 + .quad 0x3fefffffffffffff, 0x3cde00e9148a1d25 + .quad 0x3fefffffffffffff, 0x3cdb623734024e92 + .quad 0x3fefffffffffffff, 0x3cd8fd4e01891bf8 + .quad 0x3fefffffffffffff, 0x3cd6cd44c7470d89 + .quad 0x3fefffffffffffff, 0x3cd4cd9c04158cd7 + .quad 0x3fefffffffffffff, 0x3cd2fa34bf5c8344 + .quad 0x3fefffffffffffff, 0x3cd14f4890ff2461 + .quad 0x3fefffffffffffff, 0x3ccf92c49dfa4df5 + .quad 0x3fefffffffffffff, 0x3ccccaaea71ab0df + .quad 0x3fefffffffffffff, 0x3cca40829f001197 + .quad 0x3ff0000000000000, 0x3cc7eef13b59e96c + .quad 0x3ff0000000000000, 0x3cc5d11e1a252bf5 + .quad 0x3ff0000000000000, 0x3cc3e296303b2297 + .quad 0x3ff0000000000000, 0x3cc21f47009f43ce + .quad 0x3ff0000000000000, 0x3cc083768c5e4542 + .quad 0x3ff0000000000000, 0x3cbe1777d831265f + .quad 0x3ff0000000000000, 0x3cbb69f10b0191b5 + .quad 0x3ff0000000000000, 0x3cb8f8a3a05b5b53 + .quad 0x3ff0000000000000, 0x3cb6be573c40c8e7 + .quad 0x3ff0000000000000, 0x3cb4b645ba991fdb + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff /* _AbsMask */ + .align 32 + .quad 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000 /* _MaxThreshold = 6.0 - 1.0/128.0 */ + .align 32 + .quad 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000 /* SRound */ + .align 32 + .quad 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000 /* _U2THreshold */ + .align 32 + .quad 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5 /* _poly_1_0 */ + .align 32 + .quad 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1 /* _poly_1_1 */ + .align 32 + .quad 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57 /* _poly_3_0 */ + .align 32 + .quad 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8 /* _poly_3_1 */ + .align 32 + .quad 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F /* _poly_5_0 */ + .align 32 + .quad 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122 /* _poly_5_1 */ + .align 32 + .quad 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6 /* _poly_1_2 */ + .align 32 + .quad 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd /* _poly_3_2 */ + .align 32 + .quad 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c /* _poly_1_3 */ + .align 32 + .quad 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555 /* _poly_3_3 */ + .align 32 + .quad 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff /* _Mask32 */ + .align 32 + .type __svml_derf_data_internal,@object + .size __svml_derf_data_internal,.-__svml_derf_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core-avx2.S new file mode 100644 index 0000000000..3456142289 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized erf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_erf _ZGVeN8v_erf_avx2_wrapper +#include "../svml_d_erf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core.c new file mode 100644 index 0000000000..78e4a852c6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized erf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_erf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_erf, __GI__ZGVeN8v_erf, __redirect__ZGVeN8v_erf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core_avx512.S new file mode 100644 index 0000000000..38f373102a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_erf8_core_avx512.S @@ -0,0 +1,983 @@ +/* Function erf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Basic formula is + * erf(x) ~ erf(x0) + + * + exp(-x0*x0)*D*(1+c0+T*P1(T)+D^2*P3(T)+D^4*P5(T)+D^6*p7+D^8*p9) + * where D=x-x0, T=x0*D + * x0 is x rounded to a specified number of fractional bits (in this case 7), + * except that x0=0 for |x|<3.5/128.0 (using x0=0 for first 4 table entries) + * + * Data table packs both erf(x0)_high and a few bits of erf(x0)_low in one + * entry (in place of redundant exponent bits) + * + */ + +/* Offsets for data table __svml_derf_data_internal + */ +#define _erf_tbl 0 +#define _AbsMask 12288 +#define _MaxThreshold 12352 +#define _SRound 12416 +#define _U2Threshold 12480 +#define _poly1_0 12544 +#define _poly1_1 12608 +#define _poly3_0 12672 +#define _poly3_1 12736 +#define _poly5_0 12800 +#define _poly5_1 12864 +#define _poly1_2 12928 +#define _poly3_2 12992 +#define _poly1_3 13056 +#define _poly3_3 13120 +#define _Mask32 13184 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_erf_skx) +/* + * vector gather: erf(x0), + * second value is exp(-x0*x0) + */ + lea __svml_derf_data_internal(%rip), %rax + +/* + * erf(x) rounds to 1.0 for x>_MaxThreshold (5.9921875) + * can compute all results in the main path + */ + vmovups _MaxThreshold+__svml_derf_data_internal(%rip), %zmm9 + vmovups _SRound+__svml_derf_data_internal(%rip), %zmm11 + vmovups _U2Threshold+__svml_derf_data_internal(%rip), %zmm10 + vandpd _AbsMask+__svml_derf_data_internal(%rip), %zmm0, %zmm7 + vpternlogd $0xff, %zmm1, %zmm1, %zmm14 + kxnorw %k0, %k0, %k3 + kxnorw %k0, %k0, %k2 + vminpd {sae}, %zmm9, %zmm7, %zmm12 + +/* save sign */ + vxorpd %zmm0, %zmm7, %zmm8 + vaddpd {rn-sae}, %zmm11, %zmm12, %zmm15 + vcmppd $26, {sae}, %zmm10, %zmm12, %k1 + +/* + * _LA_ polynomial computation + * Start polynomial evaluation + */ + vmovups _poly1_0+__svml_derf_data_internal(%rip), %zmm10 + vpsllq $4, %zmm15, %zmm3 + vsubpd {rn-sae}, %zmm11, %zmm15, %zmm13 + vmovups _poly3_0+__svml_derf_data_internal(%rip), %zmm11 + vmovups _poly3_3+__svml_derf_data_internal(%rip), %zmm15 + vsubpd {rn-sae}, %zmm13, %zmm12, %zmm1 + vmulpd {rn-sae}, %zmm1, %zmm13, %zmm6 + +/* NaN fixup */ + vminpd {sae}, %zmm7, %zmm1, %zmm7 + vmovups _poly1_2+__svml_derf_data_internal(%rip), %zmm13 + vpandq _Mask32+__svml_derf_data_internal(%rip), %zmm3, %zmm2 + vpmovqd %zmm2, %ymm0 + vmovups _poly1_1+__svml_derf_data_internal(%rip), %zmm2 + vfmadd231pd {rn-sae}, %zmm6, %zmm10, %zmm2 + vfmadd213pd {rn-sae}, %zmm13, %zmm6, %zmm2 + vpxord %zmm4, %zmm4, %zmm4 + vgatherdpd 8(%rax,%ymm0), %zmm4{%k3} + vpxord %zmm5, %zmm5, %zmm5 + vgatherdpd (%rax,%ymm0), %zmm5{%k2} + vmovups _poly3_1+__svml_derf_data_internal(%rip), %zmm0 + +/* Sign | _Erf_H */ + vxorpd %zmm8, %zmm5, %zmm5 + vfmadd231pd {rn-sae}, %zmm6, %zmm11, %zmm0 + vpandnq %zmm12, %zmm12, %zmm14{%k1} + vandpd %zmm14, %zmm1, %zmm9 + +/* Sign | Diff */ + vxorpd %zmm8, %zmm7, %zmm1 + vmovups _poly5_0+__svml_derf_data_internal(%rip), %zmm12 + vmovups _poly5_1+__svml_derf_data_internal(%rip), %zmm7 + vmovups _poly3_2+__svml_derf_data_internal(%rip), %zmm14 + +/* D2 = Diff^2 */ + vmulpd {rn-sae}, %zmm9, %zmm9, %zmm3 + +/* T^2 */ + vmulpd {rn-sae}, %zmm6, %zmm6, %zmm9 + +/* exp_h(x0) * Diff */ + vmulpd {rn-sae}, %zmm1, %zmm4, %zmm4 + vfmadd231pd {rn-sae}, %zmm6, %zmm12, %zmm7 + vmovups _poly1_3+__svml_derf_data_internal(%rip), %zmm12 + vfmadd213pd {rn-sae}, %zmm14, %zmm6, %zmm0 + vfmadd213pd {rn-sae}, %zmm15, %zmm3, %zmm7 + vfmadd213pd {rn-sae}, %zmm12, %zmm6, %zmm2 + vfmadd213pd {rn-sae}, %zmm7, %zmm6, %zmm0 + +/* P1 = T^2*P1 - T */ + vfmsub213pd {rn-sae}, %zmm6, %zmm9, %zmm2 + +/* P1 + P3*D2 */ + vfmadd213pd {rn-sae}, %zmm2, %zmm3, %zmm0 + +/* + * branch-free + * low part of result: exp_h(x0) * Diff*(1+P1) + */ + vfmadd213pd {rn-sae}, %zmm4, %zmm4, %zmm0 + +/* Final result */ + vaddpd {rn-sae}, %zmm5, %zmm0, %zmm6 + +/* Fix erf(-0) = -0 */ + vorpd %zmm8, %zmm6, %zmm0 + ret + +END(_ZGVeN8v_erf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_derf_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _erf_tbl[6*128*2][2]; + __declspec(align(64)) VUINT32 _AbsMask[8][2]; + __declspec(align(64)) VUINT32 _MaxThreshold[8][2]; + __declspec(align(64)) VUINT32 _SRound[8][2]; + __declspec(align(64)) VUINT32 _U2Threshold[8][2]; + __declspec(align(64)) VUINT32 _poly1_0[8][2]; + __declspec(align(64)) VUINT32 _poly1_1[8][2]; + __declspec(align(64)) VUINT32 _poly3_0[8][2]; + __declspec(align(64)) VUINT32 _poly3_1[8][2]; + __declspec(align(64)) VUINT32 _poly5_0[8][2]; + __declspec(align(64)) VUINT32 _poly5_1[8][2]; + __declspec(align(64)) VUINT32 _poly1_2[8][2]; + __declspec(align(64)) VUINT32 _poly3_2[8][2]; + __declspec(align(64)) VUINT32 _poly1_3[8][2]; + __declspec(align(64)) VUINT32 _poly3_3[8][2]; + __declspec(align(64)) VUINT32 _Mask32[8][2]; +} __svml_derf_data_internal; +#endif +__svml_derf_data_internal: + /*== _erf_tbl ==*/ + .quad 0x0000000000000000, 0x3ff20dd750429b6d + .quad 0x3f820dbf3deb1340, 0x3ff20d8f1975c85d + .quad 0x3f920d77083f17a0, 0x3ff20cb67bd452c7 + .quad 0x3f9b137e0cf584dc, 0x3ff20b4d8bac36c1 + .quad 0x3fa20c5645dd2538, 0x3ff209546ad13ccf + .quad 0x3fa68e5d3bbc9526, 0x3ff206cb4897b148 + .quad 0x3fab0fafef135745, 0x3ff203b261cd0053 + .quad 0x3faf902a77bd3821, 0x3ff2000a00ae3804 + .quad 0x3fb207d480e90658, 0x3ff1fbd27cdc72d3 + .quad 0x3fb44703e87e8593, 0x3ff1f70c3b4f2cc8 + .quad 0x3fb68591a1e83b5d, 0x3ff1f1b7ae44867f + .quad 0x3fb8c36beb8a8d23, 0x3ff1ebd5552f795b + .quad 0x3fbb0081148a873a, 0x3ff1e565bca400d4 + .quad 0x3fbd3cbf7e70a4b3, 0x3ff1de697e413d29 + .quad 0x3fbf78159ec8bb50, 0x3ff1d6e14099944a + .quad 0x3fc0d939005f65e5, 0x3ff1cecdb718d61c + .quad 0x3fc1f5e1a35c3b89, 0x3ff1c62fa1e869b6 + .quad 0x3fc311fc15f56d14, 0x3ff1bd07cdd189ac + .quad 0x3fc42d7fc2f64959, 0x3ff1b357141d95d5 + .quad 0x3fc548642321d7c6, 0x3ff1a91e5a748165 + .quad 0x3fc662a0bdf7a89f, 0x3ff19e5e92b964ab + .quad 0x3fc77c2d2a765f9e, 0x3ff19318bae53a04 + .quad 0x3fc895010fdbdbfd, 0x3ff1874ddcdfce24 + .quad 0x3fc9ad142662e14d, 0x3ff17aff0e56ec10 + .quad 0x3fcac45e37fe2526, 0x3ff16e2d7093cd8c + .quad 0x3fcbdad72110a648, 0x3ff160da304ed92f + .quad 0x3fccf076d1233237, 0x3ff153068581b781 + .quad 0x3fce05354b96ff36, 0x3ff144b3b337c90c + .quad 0x3fcf190aa85540e2, 0x3ff135e3075d076b + .quad 0x3fd015f78a3dcf3d, 0x3ff12695da8b5bde + .quad 0x3fd09eed6982b948, 0x3ff116cd8fd67618 + .quad 0x3fd127631eb8de32, 0x3ff1068b94962e5e + .quad 0x3fd1af54e232d609, 0x3ff0f5d1602f7e41 + .quad 0x3fd236bef825d9a2, 0x3ff0e4a073dc1b91 + .quad 0x3fd2bd9db0f7827f, 0x3ff0d2fa5a70c168 + .quad 0x3fd343ed6989b7d9, 0x3ff0c0e0a8223359 + .quad 0x3fd3c9aa8b84beda, 0x3ff0ae54fa490723 + .quad 0x3fd44ed18d9f6462, 0x3ff09b58f724416b + .quad 0x3fd4d35ef3e5372e, 0x3ff087ee4d9ad247 + .quad 0x3fd5574f4ffac98e, 0x3ff07416b4fbfe7c + .quad 0x3fd5da9f415ff23f, 0x3ff05fd3ecbec298 + .quad 0x3fd65d4b75b00471, 0x3ff04b27bc403d30 + .quad 0x3fd6df50a8dff772, 0x3ff03613f2812daf + .quad 0x3fd760aba57a76bf, 0x3ff0209a65e29545 + .quad 0x3fd7e15944d9d3e4, 0x3ff00abcf3e187a9 + .quad 0x3fd861566f5fd3c0, 0x3fefe8fb01a47307 + .quad 0x3fd8e0a01cab516b, 0x3fefbbbbef34b4b2 + .quad 0x3fd95f3353cbb146, 0x3fef8dc092d58ff8 + .quad 0x3fd9dd0d2b721f39, 0x3fef5f0cdaf15313 + .quad 0x3fda5a2aca209394, 0x3fef2fa4c16c0019 + .quad 0x3fdad68966569a87, 0x3feeff8c4b1375db + .quad 0x3fdb522646bbda68, 0x3feecec7870ebca8 + .quad 0x3fdbccfec24855b8, 0x3fee9d5a8e4c934e + .quad 0x3fdc4710406a65fc, 0x3fee6b4982f158b9 + .quad 0x3fdcc058392a6d2d, 0x3fee38988fc46e72 + .quad 0x3fdd38d4354c3bd0, 0x3fee054be79d3042 + .quad 0x3fddb081ce6e2a48, 0x3fedd167c4cf9d2a + .quad 0x3fde275eaf25e458, 0x3fed9cf06898cdaf + .quad 0x3fde9d68931ae650, 0x3fed67ea1a8b5368 + .quad 0x3fdf129d471eabb1, 0x3fed325927fb9d89 + .quad 0x3fdf86faa9428f9d, 0x3fecfc41e36c7df9 + .quad 0x3fdffa7ea8eb5fd0, 0x3fecc5a8a3fbea40 + .quad 0x3fe03693a371519c, 0x3fec8e91c4d01368 + .quad 0x3fe06f794ab2cae7, 0x3fec5701a484ef9d + .quad 0x3fe0a7ef5c18edd2, 0x3fec1efca49a5011 + .quad 0x3fe0dff4f247f6c6, 0x3febe68728e29d5e + .quad 0x3fe1178930ada115, 0x3febada596f25436 + .quad 0x3fe14eab43841b55, 0x3feb745c55905bf8 + .quad 0x3fe1855a5fd3dd50, 0x3feb3aafcc27502e + .quad 0x3fe1bb95c3746199, 0x3feb00a46237d5be + .quad 0x3fe1f15cb50bc4de, 0x3feac63e7ecc1411 + .quad 0x3fe226ae840d4d70, 0x3fea8b8287ec6a09 + .quad 0x3fe25b8a88b6dd7f, 0x3fea5074e2157620 + .quad 0x3fe28ff0240d52cd, 0x3fea1519efaf889e + .quad 0x3fe2c3debfd7d6c1, 0x3fe9d97610879642 + .quad 0x3fe2f755ce9a21f4, 0x3fe99d8da149c13f + .quad 0x3fe32a54cb8db67b, 0x3fe96164fafd8de3 + .quad 0x3fe35cdb3a9a144d, 0x3fe925007283d7aa + .quad 0x3fe38ee8a84beb71, 0x3fe8e86458169af8 + .quad 0x3fe3c07ca9cb4f9e, 0x3fe8ab94f6caa71d + .quad 0x3fe3f196dcd0f135, 0x3fe86e9694134b9e + .quad 0x3fe42236e79a5fa6, 0x3fe8316d6f48133d + .quad 0x3fe4525c78dd5966, 0x3fe7f41dc12c9e89 + .quad 0x3fe4820747ba2dc2, 0x3fe7b6abbb7aaf19 + .quad 0x3fe4b13713ad3513, 0x3fe7791b886e7403 + .quad 0x3fe4dfeba47f63cc, 0x3fe73b714a552763 + .quad 0x3fe50e24ca35fd2c, 0x3fe6fdb11b1e0c34 + .quad 0x3fe53be25d016a4f, 0x3fe6bfdf0beddaf5 + .quad 0x3fe569243d2b3a9b, 0x3fe681ff24b4ab04 + .quad 0x3fe595ea53035283, 0x3fe6441563c665d4 + .quad 0x3fe5c2348ecc4dc3, 0x3fe60625bd75d07b + .quad 0x3fe5ee02e8a71a53, 0x3fe5c8341bb23767 + .quad 0x3fe61955607dd15d, 0x3fe58a445da7c74c + .quad 0x3fe6442bfdedd397, 0x3fe54c5a57629db0 + .quad 0x3fe66e86d0312e82, 0x3fe50e79d1749ac9 + .quad 0x3fe69865ee075011, 0x3fe4d0a6889dfd9f + .quad 0x3fe6c1c9759d0e5f, 0x3fe492e42d78d2c5 + .quad 0x3fe6eab18c74091b, 0x3fe4553664273d24 + .quad 0x3fe7131e5f496a5a, 0x3fe417a0c4049fd0 + .quad 0x3fe73b1021fc0cb8, 0x3fe3da26d759aef5 + .quad 0x3fe762870f720c6f, 0x3fe39ccc1b136d5a + .quad 0x3fe78983697dc96f, 0x3fe35f93fe7d1b3d + .quad 0x3fe7b00578c26037, 0x3fe32281e2fd1a92 + .quad 0x3fe7d60d8c979f7b, 0x3fe2e5991bd4cbfc + .quad 0x3fe7fb9bfaed8078, 0x3fe2a8dcede3673b + .quad 0x3fe820b1202f27fb, 0x3fe26c508f6bd0ff + .quad 0x3fe8454d5f25760d, 0x3fe22ff727dd6f7b + .quad 0x3fe8697120d92a4a, 0x3fe1f3d3cf9ffe5a + .quad 0x3fe88d1cd474a2e0, 0x3fe1b7e98fe26217 + .quad 0x3fe8b050ef253c37, 0x3fe17c3b626c7a12 + .quad 0x3fe8d30debfc572e, 0x3fe140cc3173f007 + .quad 0x3fe8f5544bd00c04, 0x3fe1059ed7740313 + .quad 0x3fe91724951b8fc6, 0x3fe0cab61f084b93 + .quad 0x3fe9387f53df5238, 0x3fe09014c2ca74da + .quad 0x3fe959651980da31, 0x3fe055bd6d32e8d7 + .quad 0x3fe979d67caa6631, 0x3fe01bb2b87c6968 + .quad 0x3fe999d4192a5715, 0x3fdfc3ee5d1524b0 + .quad 0x3fe9b95e8fd26aba, 0x3fdf511a91a67d2a + .quad 0x3fe9d8768656cc42, 0x3fdedeeee0959518 + .quad 0x3fe9f71ca72cffb6, 0x3fde6d6ffaa65a25 + .quad 0x3fea1551a16aaeaf, 0x3fddfca26f5bbf88 + .quad 0x3fea331628a45b92, 0x3fdd8c8aace11e63 + .quad 0x3fea506af4cc00f4, 0x3fdd1d2cfff91594 + .quad 0x3fea6d50c20fa293, 0x3fdcae8d93f1d7b7 + .quad 0x3fea89c850b7d54d, 0x3fdc40b0729ed548 + .quad 0x3feaa5d265064366, 0x3fdbd3998457afdb + .quad 0x3feac16fc7143263, 0x3fdb674c8ffc6283 + .quad 0x3feadca142b10f98, 0x3fdafbcd3afe8ab6 + .quad 0x3feaf767a741088b, 0x3fda911f096fbc26 + .quad 0x3feb11c3c79bb424, 0x3fda27455e14c93c + .quad 0x3feb2bb679ead19c, 0x3fd9be437a7de946 + .quad 0x3feb4540978921ee, 0x3fd9561c7f23a47b + .quad 0x3feb5e62fce16095, 0x3fd8eed36b886d93 + .quad 0x3feb771e894d602e, 0x3fd8886b1e5ecfd1 + .quad 0x3feb8f741ef54f83, 0x3fd822e655b417e7 + .quad 0x3feba764a2af2b78, 0x3fd7be47af1f5d89 + .quad 0x3febbef0fbde6221, 0x3fd75a91a7f4d2ed + .quad 0x3febd61a1453ab44, 0x3fd6f7c69d7d3ef8 + .quad 0x3febece0d82d1a5c, 0x3fd695e8cd31867e + .quad 0x3fec034635b66e23, 0x3fd634fa54fa285f + .quad 0x3fec194b1d49a184, 0x3fd5d4fd33729015 + .quad 0x3fec2ef0812fc1bd, 0x3fd575f3483021c3 + .quad 0x3fec443755820d64, 0x3fd517de540ce2a3 + .quad 0x3fec5920900b5fd1, 0x3fd4babff975a04c + .quad 0x3fec6dad2829ec62, 0x3fd45e99bcbb7915 + .quad 0x3fec81de16b14cef, 0x3fd4036d0468a7a2 + .quad 0x3fec95b455cce69d, 0x3fd3a93b1998736c + .quad 0x3feca930e0e2a825, 0x3fd35005285227f1 + .quad 0x3fecbc54b476248d, 0x3fd2f7cc3fe6f423 + .quad 0x3feccf20ce0c0d27, 0x3fd2a09153529381 + .quad 0x3fece1962c0e0d8b, 0x3fd24a55399ea239 + .quad 0x3fecf3b5cdaf0c39, 0x3fd1f518ae487dc8 + .quad 0x3fed0580b2cfd249, 0x3fd1a0dc51a9934d + .quad 0x3fed16f7dbe41ca0, 0x3fd14da0a961fd14 + .quad 0x3fed281c49d818d0, 0x3fd0fb6620c550af + .quad 0x3fed38eefdf64fdd, 0x3fd0aa2d09497f2b + .quad 0x3fed4970f9ce00d9, 0x3fd059f59af7a906 + .quad 0x3fed59a33f19ed42, 0x3fd00abff4dec7a3 + .quad 0x3fed6986cfa798e7, 0x3fcf79183b101c5b + .quad 0x3fed791cad3eff01, 0x3fcedeb406d9c825 + .quad 0x3fed8865d98abe01, 0x3fce4652fadcb6b2 + .quad 0x3fed97635600bb89, 0x3fcdaff4969c0b04 + .quad 0x3feda61623cb41e0, 0x3fcd1b982c501370 + .quad 0x3fedb47f43b2980d, 0x3fcc893ce1dcbef7 + .quad 0x3fedc29fb60715af, 0x3fcbf8e1b1ca2279 + .quad 0x3fedd0787a8bb39d, 0x3fcb6a856c3ed54f + .quad 0x3fedde0a90611a0d, 0x3fcade26b7fbed95 + .quad 0x3fedeb56f5f12d28, 0x3fca53c4135a6526 + .quad 0x3fedf85ea8db188e, 0x3fc9cb5bd549b111 + .quad 0x3fee0522a5dfda73, 0x3fc944ec2e4f5630 + .quad 0x3fee11a3e8cf4eb8, 0x3fc8c07329874652 + .quad 0x3fee1de36c75ba58, 0x3fc83deeada4d25a + .quad 0x3fee29e22a89d766, 0x3fc7bd5c7df3fe9c + .quad 0x3fee35a11b9b61ce, 0x3fc73eba3b5b07b7 + .quad 0x3fee4121370224cc, 0x3fc6c205655be720 + .quad 0x3fee4c6372cd8927, 0x3fc6473b5b15a7a1 + .quad 0x3fee5768c3b4a3fc, 0x3fc5ce595c455b0a + .quad 0x3fee62321d06c5e0, 0x3fc5575c8a468362 + .quad 0x3fee6cc0709c8a0d, 0x3fc4e241e912c305 + .quad 0x3fee7714aec96534, 0x3fc46f066040a832 + .quad 0x3fee812fc64db369, 0x3fc3fda6bc016994 + .quad 0x3fee8b12a44944a8, 0x3fc38e1fae1d6a9d + .quad 0x3fee94be342e6743, 0x3fc3206dceef5f87 + .quad 0x3fee9e335fb56f87, 0x3fc2b48d9e5dea1c + .quad 0x3feea7730ed0bbb9, 0x3fc24a7b84d38971 + .quad 0x3feeb07e27a133aa, 0x3fc1e233d434b813 + .quad 0x3feeb9558e6b42ce, 0x3fc17bb2c8d41535 + .quad 0x3feec1fa258c4bea, 0x3fc116f48a6476cc + .quad 0x3feeca6ccd709544, 0x3fc0b3f52ce8c383 + .quad 0x3feed2ae6489ac1e, 0x3fc052b0b1a174ea + .quad 0x3feedabfc7453e63, 0x3fbfe6460fef4680 + .quad 0x3feee2a1d004692c, 0x3fbf2a901ccafb37 + .quad 0x3feeea5557137ae0, 0x3fbe723726b824a9 + .quad 0x3feef1db32a2277c, 0x3fbdbd32ac4c99b0 + .quad 0x3feef93436bc2daa, 0x3fbd0b7a0f921e7c + .quad 0x3fef006135426b26, 0x3fbc5d0497c09e74 + .quad 0x3fef0762fde45ee6, 0x3fbbb1c972f23e50 + .quad 0x3fef0e3a5e1a1788, 0x3fbb09bfb7d11a84 + .quad 0x3fef14e8211e8c55, 0x3fba64de673e8837 + .quad 0x3fef1b6d0fea5f4d, 0x3fb9c31c6df3b1b8 + .quad 0x3fef21c9f12f0677, 0x3fb92470a61b6965 + .quad 0x3fef27ff89525acf, 0x3fb888d1d8e510a3 + .quad 0x3fef2e0e9a6a8b09, 0x3fb7f036c0107294 + .quad 0x3fef33f7e43a706b, 0x3fb75a96077274ba + .quad 0x3fef39bc242e43e6, 0x3fb6c7e64e7281cb + .quad 0x3fef3f5c1558b19e, 0x3fb6381e2980956b + .quad 0x3fef44d870704911, 0x3fb5ab342383d178 + .quad 0x3fef4a31ebcd47df, 0x3fb5211ebf41880b + .quad 0x3fef4f693b67bd77, 0x3fb499d478bca735 + .quad 0x3fef547f10d60597, 0x3fb4154bc68d75c3 + .quad 0x3fef59741b4b97cf, 0x3fb3937b1b31925a + .quad 0x3fef5e4907982a07, 0x3fb31458e6542847 + .quad 0x3fef62fe80272419, 0x3fb297db960e4f63 + .quad 0x3fef67952cff6282, 0x3fb21df9981f8e53 + .quad 0x3fef6c0db3c34641, 0x3fb1a6a95b1e786f + .quad 0x3fef7068b7b10fd9, 0x3fb131e14fa1625d + .quad 0x3fef74a6d9a38383, 0x3fb0bf97e95f2a64 + .quad 0x3fef78c8b812d498, 0x3fb04fc3a0481321 + .quad 0x3fef7cceef15d631, 0x3fafc4b5e32d6259 + .quad 0x3fef80ba18636f07, 0x3faeeea8c1b1db94 + .quad 0x3fef848acb544e95, 0x3fae1d4cf1e2450a + .quad 0x3fef88419ce4e184, 0x3fad508f9a1ea64f + .quad 0x3fef8bdf1fb78370, 0x3fac885df3451a07 + .quad 0x3fef8f63e416ebff, 0x3fabc4a54a84e834 + .quad 0x3fef92d077f8d56d, 0x3fab055303221015 + .quad 0x3fef96256700da8e, 0x3faa4a549829587e + .quad 0x3fef99633a838a57, 0x3fa993979e14fffe + .quad 0x3fef9c8a7989af0d, 0x3fa8e109c4622913 + .quad 0x3fef9f9ba8d3c733, 0x3fa83298d717210e + .quad 0x3fefa2974addae45, 0x3fa78832c03aa2b1 + .quad 0x3fefa57ddfe27376, 0x3fa6e1c5893c380b + .quad 0x3fefa84fe5e05c8d, 0x3fa63f3f5c4de13b + .quad 0x3fefab0dd89d1309, 0x3fa5a08e85af27e0 + .quad 0x3fefadb831a9f9c3, 0x3fa505a174e9c929 + .quad 0x3fefb04f6868a944, 0x3fa46e66be002240 + .quad 0x3fefb2d3f20f9101, 0x3fa3dacd1a8d8cce + .quad 0x3fefb54641aebbc9, 0x3fa34ac36ad8dafe + .quad 0x3fefb7a6c834b5a2, 0x3fa2be38b6d92415 + .quad 0x3fefb9f5f4739170, 0x3fa2351c2f2d1449 + .quad 0x3fefbc3433260ca5, 0x3fa1af5d2e04f3f6 + .quad 0x3fefbe61eef4cf6a, 0x3fa12ceb37ff9bc3 + .quad 0x3fefc07f907bc794, 0x3fa0adb5fcfa8c75 + .quad 0x3fefc28d7e4f9cd0, 0x3fa031ad58d56279 + .quad 0x3fefc48c1d033c7a, 0x3f9f7182a851bca2 + .quad 0x3fefc67bcf2d7b8f, 0x3f9e85c449e377f3 + .quad 0x3fefc85cf56ecd38, 0x3f9da0005e5f28df + .quad 0x3fefca2fee770c79, 0x3f9cc0180af00a8b + .quad 0x3fefcbf5170b578b, 0x3f9be5ecd2fcb5f9 + .quad 0x3fefcdacca0bfb73, 0x3f9b1160991ff737 + .quad 0x3fefcf57607a6e7c, 0x3f9a4255a00b9f03 + .quad 0x3fefd0f5317f582f, 0x3f9978ae8b55ce1b + .quad 0x3fefd2869270a56f, 0x3f98b44e6031383e + .quad 0x3fefd40bd6d7a785, 0x3f97f5188610ddc8 + .quad 0x3fefd58550773cb5, 0x3f973af0c737bb45 + .quad 0x3fefd6f34f52013a, 0x3f9685bb5134ef13 + .quad 0x3fefd85621b0876d, 0x3f95d55cb54cd53a + .quad 0x3fefd9ae142795e3, 0x3f9529b9e8cf9a1e + .quad 0x3fefdafb719e6a69, 0x3f9482b8455dc491 + .quad 0x3fefdc3e835500b3, 0x3f93e03d891b37de + .quad 0x3fefdd7790ea5bc0, 0x3f93422fd6d12e2b + .quad 0x3fefdea6e062d0c9, 0x3f92a875b5ffab56 + .quad 0x3fefdfccb62e52d3, 0x3f9212f612dee7fb + .quad 0x3fefe0e9552ebdd6, 0x3f9181983e5133dd + .quad 0x3fefe1fcfebe2083, 0x3f90f443edc5ce49 + .quad 0x3fefe307f2b503d0, 0x3f906ae13b0d3255 + .quad 0x3fefe40a6f70af4b, 0x3f8fcab1483ea7fc + .quad 0x3fefe504b1d9696c, 0x3f8ec72615a894c4 + .quad 0x3fefe5f6f568b301, 0x3f8dcaf3691fc448 + .quad 0x3fefe6e1742f7cf6, 0x3f8cd5ec93c12432 + .quad 0x3fefe7c466dc57a1, 0x3f8be7e5ac24963b + .quad 0x3fefe8a004c19ae6, 0x3f8b00b38d6b3575 + .quad 0x3fefe97483db8670, 0x3f8a202bd6372dce + .quad 0x3fefea4218d6594a, 0x3f894624e78e0faf + .quad 0x3fefeb08f7146046, 0x3f887275e3a6869e + .quad 0x3fefebc950b3fa75, 0x3f87a4f6aca256cb + .quad 0x3fefec835695932e, 0x3f86dd7fe3358230 + .quad 0x3fefed37386190fb, 0x3f861beae53b72b7 + .quad 0x3fefede5248e38f4, 0x3f856011cc3b036d + .quad 0x3fefee8d486585ee, 0x3f84a9cf6bda3f4c + .quad 0x3fefef2fd00af31a, 0x3f83f8ff5042a88e + .quad 0x3fefefcce6813974, 0x3f834d7dbc76d7e5 + .quad 0x3feff064b5afffbe, 0x3f82a727a89a3f14 + .quad 0x3feff0f766697c76, 0x3f8205dac02bd6b9 + .quad 0x3feff18520700971, 0x3f81697560347b26 + .quad 0x3feff20e0a7ba8c2, 0x3f80d1d69569b82d + .quad 0x3feff2924a3f7a83, 0x3f803ede1a45bfee + .quad 0x3feff312046f2339, 0x3f7f60d8aa2a88f2 + .quad 0x3feff38d5cc4227f, 0x3f7e4cc4abf7d065 + .quad 0x3feff404760319b4, 0x3f7d4143a9dfe965 + .quad 0x3feff47772010262, 0x3f7c3e1a5f5c077c + .quad 0x3feff4e671a85425, 0x3f7b430ecf4a83a8 + .quad 0x3feff55194fe19df, 0x3f7a4fe83fb9db25 + .quad 0x3feff5b8fb26f5f6, 0x3f79646f35a76624 + .quad 0x3feff61cc26c1578, 0x3f78806d70b2fc36 + .quad 0x3feff67d08401202, 0x3f77a3ade6c8b3e5 + .quad 0x3feff6d9e943c231, 0x3f76cdfcbfc1e263 + .quad 0x3feff733814af88c, 0x3f75ff2750fe7820 + .quad 0x3feff789eb6130c9, 0x3f7536fc18f7ce5c + .quad 0x3feff7dd41ce2b4d, 0x3f74754abacdf1dc + .quad 0x3feff82d9e1a76d8, 0x3f73b9e3f9d06e3f + .quad 0x3feff87b1913e853, 0x3f730499b503957f + .quad 0x3feff8c5cad200a5, 0x3f72553ee2a336bf + .quad 0x3feff90dcaba4096, 0x3f71aba78ba3af89 + .quad 0x3feff9532f846ab0, 0x3f7107a8c7323a6e + .quad 0x3feff9960f3eb327, 0x3f706918b6355624 + .quad 0x3feff9d67f51ddba, 0x3f6f9f9cfd9c3035 + .quad 0x3feffa14948549a7, 0x3f6e77448fb66bb9 + .quad 0x3feffa506302ebae, 0x3f6d58da68fd1170 + .quad 0x3feffa89fe5b3625, 0x3f6c4412bf4b8f0b + .quad 0x3feffac17988ef4b, 0x3f6b38a3af2e55b4 + .quad 0x3feffaf6e6f4f5c0, 0x3f6a3645330550ff + .quad 0x3feffb2a5879f35e, 0x3f693cb11a30d765 + .quad 0x3feffb5bdf67fe6f, 0x3f684ba3004a50d0 + .quad 0x3feffb8b8c88295f, 0x3f6762d84469c18f + .quad 0x3feffbb970200110, 0x3f66821000795a03 + .quad 0x3feffbe599f4f9d9, 0x3f65a90b00981d93 + .quad 0x3feffc10194fcb64, 0x3f64d78bba8ca5fd + .quad 0x3feffc38fcffbb7c, 0x3f640d564548fad7 + .quad 0x3feffc60535dd7f5, 0x3f634a305080681f + .quad 0x3feffc862a501fd7, 0x3f628de11c5031eb + .quad 0x3feffcaa8f4c9bea, 0x3f61d83170fbf6fb + .quad 0x3feffccd8f5c66d1, 0x3f6128eb96be8798 + .quad 0x3feffcef371ea4d7, 0x3f607fdb4dafea5f + .quad 0x3feffd0f92cb6ba7, 0x3f5fb99b8b8279e1 + .quad 0x3feffd2eae369a07, 0x3f5e7f232d9e2630 + .quad 0x3feffd4c94d29fdb, 0x3f5d4fed7195d7e8 + .quad 0x3feffd6951b33686, 0x3f5c2b9cf7f893bf + .quad 0x3feffd84ef9009ee, 0x3f5b11d702b3deb2 + .quad 0x3feffd9f78c7524a, 0x3f5a024365f771bd + .quad 0x3feffdb8f7605ee7, 0x3f58fc8c794b03b5 + .quad 0x3feffdd1750e1220, 0x3f58005f08d6f1ef + .quad 0x3feffde8fb314ebf, 0x3f570d6a46e07dda + .quad 0x3feffdff92db56e5, 0x3f56235fbd7a4345 + .quad 0x3feffe1544d01ccb, 0x3f5541f340697987 + .quad 0x3feffe2a1988857c, 0x3f5468dadf4080ab + .quad 0x3feffe3e19349dc7, 0x3f5397ced7af2b15 + .quad 0x3feffe514bbdc197, 0x3f52ce898809244e + .quad 0x3feffe63b8c8b5f7, 0x3f520cc76202c5fb + .quad 0x3feffe7567b7b5e1, 0x3f515246dda49d47 + .quad 0x3feffe865fac722b, 0x3f509ec86c75d497 + .quad 0x3feffe96a78a04a9, 0x3f4fe41cd9bb4eee + .quad 0x3feffea645f6d6da, 0x3f4e97ba3b77f306 + .quad 0x3feffeb5415e7c44, 0x3f4d57f524723822 + .quad 0x3feffec39ff380b9, 0x3f4c245d4b99847a + .quad 0x3feffed167b12ac2, 0x3f4afc85e0f82e12 + .quad 0x3feffede9e5d3262, 0x3f49e005769dbc1d + .quad 0x3feffeeb49896c6d, 0x3f48ce75e9f6f8a0 + .quad 0x3feffef76e956a9f, 0x3f47c7744d9378f7 + .quad 0x3fefff0312b010b5, 0x3f46caa0d3582fe9 + .quad 0x3fefff0e3ad91ec2, 0x3f45d79eb71e893b + .quad 0x3fefff18ebe2b0e1, 0x3f44ee1429bf7cc0 + .quad 0x3fefff232a72b48e, 0x3f440daa3c89f5b6 + .quad 0x3fefff2cfb0453d9, 0x3f43360ccd23db3a + .quad 0x3fefff3661e9569d, 0x3f4266ea71d4f71a + .quad 0x3fefff3f634b79f9, 0x3f419ff4663ae9df + .quad 0x3fefff48032dbe40, 0x3f40e0de78654d1e + .quad 0x3fefff50456dab8c, 0x3f40295ef6591848 + .quad 0x3fefff582dc48d30, 0x3f3ef25d37f49fe1 + .quad 0x3fefff5fbfc8a439, 0x3f3da01102b5f851 + .quad 0x3fefff66feee5129, 0x3f3c5b5412dcafad + .quad 0x3fefff6dee89352e, 0x3f3b23a5a23e4210 + .quad 0x3fefff7491cd4af6, 0x3f39f8893d8fd1c1 + .quad 0x3fefff7aebcff755, 0x3f38d986a4187285 + .quad 0x3fefff80ff8911fd, 0x3f37c629a822bc9e + .quad 0x3fefff86cfd3e657, 0x3f36be02102b3520 + .quad 0x3fefff8c5f702ccf, 0x3f35c0a378c90bca + .quad 0x3fefff91b102fca8, 0x3f34cda5374ea275 + .quad 0x3fefff96c717b695, 0x3f33e4a23d1f4703 + .quad 0x3fefff9ba420e834, 0x3f330538fbb77ecd + .quad 0x3fefffa04a7928b1, 0x3f322f0b496539be + .quad 0x3fefffa4bc63ee9a, 0x3f3161be46ad3b50 + .quad 0x3fefffa8fc0e5f33, 0x3f309cfa445b00ff + .quad 0x3fefffad0b901755, 0x3f2fc0d55470cf51 + .quad 0x3fefffb0ecebee1b, 0x3f2e577bbcd49935 + .quad 0x3fefffb4a210b172, 0x3f2cfd4a5adec5c0 + .quad 0x3fefffb82cd9dcbf, 0x3f2bb1a9657ce465 + .quad 0x3fefffbb8f1049c6, 0x3f2a740684026555 + .quad 0x3fefffbeca6adbe9, 0x3f2943d4a1d1ed39 + .quad 0x3fefffc1e08f25f5, 0x3f28208bc334a6a5 + .quad 0x3fefffc4d3120aa1, 0x3f2709a8db59f25c + .quad 0x3fefffc7a37857d2, 0x3f25feada379d8b7 + .quad 0x3fefffca53375ce3, 0x3f24ff207314a102 + .quad 0x3fefffcce3b57bff, 0x3f240a8c1949f75e + .quad 0x3fefffcf564ab6b7, 0x3f23207fb7420eb9 + .quad 0x3fefffd1ac4135f9, 0x3f22408e9ba3327f + .quad 0x3fefffd3e6d5cd87, 0x3f216a501f0e42ca + .quad 0x3fefffd607387b07, 0x3f209d5f819c9e29 + .quad 0x3fefffd80e8ce0da, 0x3f1fb2b792b40a22 + .quad 0x3fefffd9fdeabcce, 0x3f1e3bcf436a1a95 + .quad 0x3fefffdbd65e5ad0, 0x3f1cd55277c18d05 + .quad 0x3fefffdd98e903b2, 0x3f1b7e94604479dc + .quad 0x3fefffdf46816833, 0x3f1a36eec00926dd + .quad 0x3fefffe0e0140857, 0x3f18fdc1b2dcf7b9 + .quad 0x3fefffe26683972a, 0x3f17d2737527c3f9 + .quad 0x3fefffe3daa95b18, 0x3f16b4702d7d5849 + .quad 0x3fefffe53d558ae9, 0x3f15a329b7d30748 + .quad 0x3fefffe68f4fa777, 0x3f149e17724f4d41 + .quad 0x3fefffe7d156d244, 0x3f13a4b60ba9aa4e + .quad 0x3fefffe904222101, 0x3f12b6875310f785 + .quad 0x3fefffea2860ee1e, 0x3f11d312098e9dba + .quad 0x3fefffeb3ebb267b, 0x3f10f9e1b4dd36df + .quad 0x3fefffec47d19457, 0x3f102a8673a94692 + .quad 0x3fefffed443e2787, 0x3f0ec929a665b449 + .quad 0x3fefffee34943b15, 0x3f0d4f4b4c8e09ed + .quad 0x3fefffef1960d85d, 0x3f0be6abbb10a5aa + .quad 0x3fefffeff32af7af, 0x3f0a8e8cc1fadef6 + .quad 0x3feffff0c273bea2, 0x3f094637d5bacfdb + .quad 0x3feffff187b6bc0e, 0x3f080cfdc72220cf + .quad 0x3feffff2436a21dc, 0x3f06e2367dc27f95 + .quad 0x3feffff2f5fefcaa, 0x3f05c540b4936fd2 + .quad 0x3feffff39fe16963, 0x3f04b581b8d170fc + .quad 0x3feffff44178c8d2, 0x3f03b2652b06c2b2 + .quad 0x3feffff4db27f146, 0x3f02bb5cc22e5db6 + .quad 0x3feffff56d4d5e5e, 0x3f01cfe010e2052d + .quad 0x3feffff5f8435efc, 0x3f00ef6c4c84a0fe + .quad 0x3feffff67c604180, 0x3f001984165a5f36 + .quad 0x3feffff6f9f67e55, 0x3efe9b5e8d00ce77 + .quad 0x3feffff77154e0d6, 0x3efd16f5716c6c1a + .quad 0x3feffff7e2c6aea2, 0x3efba4f035d60e03 + .quad 0x3feffff84e93cd75, 0x3efa447b7b03f045 + .quad 0x3feffff8b500e77c, 0x3ef8f4ccca7fc90d + .quad 0x3feffff9164f8e46, 0x3ef7b5223dac7336 + .quad 0x3feffff972be5c59, 0x3ef684c227fcacef + .quad 0x3feffff9ca891572, 0x3ef562fac4329b48 + .quad 0x3feffffa1de8c582, 0x3ef44f21e49054f2 + .quad 0x3feffffa6d13de73, 0x3ef34894a5e24657 + .quad 0x3feffffab83e54b8, 0x3ef24eb7254ccf83 + .quad 0x3feffffaff99bac4, 0x3ef160f438c70913 + .quad 0x3feffffb43555b5f, 0x3ef07ebd2a2d2844 + .quad 0x3feffffb839e52f3, 0x3eef4f12e9ab070a + .quad 0x3feffffbc09fa7cd, 0x3eedb5ad0b27805c + .quad 0x3feffffbfa82616b, 0x3eec304efa2c6f4e + .quad 0x3feffffc316d9ed0, 0x3eeabe09e9144b5e + .quad 0x3feffffc6586abf6, 0x3ee95df988e76644 + .quad 0x3feffffc96f1165e, 0x3ee80f439b4ee04b + .quad 0x3feffffcc5cec0c1, 0x3ee6d11788a69c64 + .quad 0x3feffffcf23ff5fc, 0x3ee5a2adfa0b4bc4 + .quad 0x3feffffd1c637b2b, 0x3ee4834877429b8f + .quad 0x3feffffd4456a10d, 0x3ee37231085c7d9a + .quad 0x3feffffd6a3554a1, 0x3ee26eb9daed6f7e + .quad 0x3feffffd8e1a2f22, 0x3ee1783ceac28910 + .quad 0x3feffffdb01e8546, 0x3ee08e1badf0fced + .quad 0x3feffffdd05a75ea, 0x3edf5f7d88472604 + .quad 0x3feffffdeee4f810, 0x3eddb92b5212fb8d + .quad 0x3feffffe0bd3e852, 0x3edc282cd3957eda + .quad 0x3feffffe273c15b7, 0x3edaab7abace48dc + .quad 0x3feffffe41314e06, 0x3ed94219bfcb4928 + .quad 0x3feffffe59c6698b, 0x3ed7eb1a2075864e + .quad 0x3feffffe710d565e, 0x3ed6a597219a93da + .quad 0x3feffffe8717232d, 0x3ed570b69502f313 + .quad 0x3feffffe9bf4098c, 0x3ed44ba864670882 + .quad 0x3feffffeafb377d5, 0x3ed335a62115bce2 + .quad 0x3feffffec2641a9e, 0x3ed22df298214423 + .quad 0x3feffffed413e5b7, 0x3ed133d96ae7e0dd + .quad 0x3feffffee4d01cd6, 0x3ed046aeabcfcdec + .quad 0x3feffffef4a55bd4, 0x3ececb9cfe1d8642 + .quad 0x3fefffff039f9e8f, 0x3ecd21397ead99cb + .quad 0x3fefffff11ca4876, 0x3ecb8d094c86d374 + .quad 0x3fefffff1f302bc1, 0x3eca0df0f0c626dc + .quad 0x3fefffff2bdb904d, 0x3ec8a2e269750a39 + .quad 0x3fefffff37d63a36, 0x3ec74adc8f4064d3 + .quad 0x3fefffff43297019, 0x3ec604ea819f007c + .quad 0x3fefffff4dde0118, 0x3ec4d0231928c6f9 + .quad 0x3fefffff57fc4a95, 0x3ec3aba85fe22e20 + .quad 0x3fefffff618c3da6, 0x3ec296a70f414053 + .quad 0x3fefffff6a956450, 0x3ec1905613b3abf2 + .quad 0x3fefffff731ee681, 0x3ec097f6156f32c5 + .quad 0x3fefffff7b2f8ed6, 0x3ebf59a20caf6695 + .quad 0x3fefffff82cdcf1b, 0x3ebd9c73698fb1dc + .quad 0x3fefffff89ffc4aa, 0x3ebbf716c6168bae + .quad 0x3fefffff90cb3c81, 0x3eba6852c6b58392 + .quad 0x3fefffff9735b73b, 0x3eb8eefd70594a89 + .quad 0x3fefffff9d446ccc, 0x3eb789fb715aae95 + .quad 0x3fefffffa2fc5015, 0x3eb6383f726a8e04 + .quad 0x3fefffffa8621251, 0x3eb4f8c96f26a26a + .quad 0x3fefffffad7a2652, 0x3eb3caa61607f920 + .quad 0x3fefffffb248c39d, 0x3eb2acee2f5ecdb8 + .quad 0x3fefffffb6d1e95d, 0x3eb19ec60b1242ed + .quad 0x3fefffffbb196132, 0x3eb09f5cf4dd2877 + .quad 0x3fefffffbf22c1e2, 0x3eaf5bd95d8730d8 + .quad 0x3fefffffc2f171e3, 0x3ead9371e2ff7c35 + .quad 0x3fefffffc688a9cf, 0x3eabe41de54d155a + .quad 0x3fefffffc9eb76ac, 0x3eaa4c89e08ef4f3 + .quad 0x3fefffffcd1cbc28, 0x3ea8cb738399b12c + .quad 0x3fefffffd01f36af, 0x3ea75fa8dbc84bec + .quad 0x3fefffffd2f57d68, 0x3ea608078a70dcbc + .quad 0x3fefffffd5a2041f, 0x3ea4c37c0394d094 + .quad 0x3fefffffd8271d12, 0x3ea39100d5687bfe + .quad 0x3fefffffda86faa9, 0x3ea26f9df8519bd7 + .quad 0x3fefffffdcc3b117, 0x3ea15e6827001f18 + .quad 0x3fefffffdedf37ed, 0x3ea05c803e4831c1 + .quad 0x3fefffffe0db6b91, 0x3e9ed22548cffd35 + .quad 0x3fefffffe2ba0ea5, 0x3e9d06ad6ecdf971 + .quad 0x3fefffffe47ccb60, 0x3e9b551c847fbc96 + .quad 0x3fefffffe62534d4, 0x3e99bc09f112b494 + .quad 0x3fefffffe7b4c81e, 0x3e983a1ff0aa239d + .quad 0x3fefffffe92ced93, 0x3e96ce1aa3fd7bdd + .quad 0x3fefffffea8ef9cf, 0x3e9576c72b514859 + .quad 0x3fefffffebdc2ec6, 0x3e943302cc4a0da8 + .quad 0x3fefffffed15bcba, 0x3e9301ba221dc9bb + .quad 0x3fefffffee3cc32c, 0x3e91e1e857adc568 + .quad 0x3fefffffef5251c2, 0x3e90d2966b1746f7 + .quad 0x3feffffff0576917, 0x3e8fa5b4f49cc6b2 + .quad 0x3feffffff14cfb92, 0x3e8dc3ae30b55c16 + .quad 0x3feffffff233ee1d, 0x3e8bfd7555a3bd68 + .quad 0x3feffffff30d18e8, 0x3e8a517d9e61628a + .quad 0x3feffffff3d9480f, 0x3e88be4f8f6c951f + .quad 0x3feffffff4993c46, 0x3e874287ded49339 + .quad 0x3feffffff54dab72, 0x3e85dcd669f2cd34 + .quad 0x3feffffff5f74141, 0x3e848bfd38302871 + .quad 0x3feffffff6969fb8, 0x3e834ecf8a3c124a + .quad 0x3feffffff72c5fb6, 0x3e822430f521cbcf + .quad 0x3feffffff7b91176, 0x3e810b1488aeb235 + .quad 0x3feffffff83d3d07, 0x3e80027c00a263a6 + .quad 0x3feffffff8b962be, 0x3e7e12ee004efc37 + .quad 0x3feffffff92dfba2, 0x3e7c3e44ae32b16b + .quad 0x3feffffff99b79d2, 0x3e7a854ea14102a8 + .quad 0x3feffffffa0248e8, 0x3e78e6761569f45d + .quad 0x3feffffffa62ce54, 0x3e77603bac345f65 + .quad 0x3feffffffabd69b4, 0x3e75f1353cdad001 + .quad 0x3feffffffb127525, 0x3e74980cb3c80949 + .quad 0x3feffffffb624592, 0x3e73537f00b6ad4d + .quad 0x3feffffffbad2aff, 0x3e72225b12bffc68 + .quad 0x3feffffffbf370cd, 0x3e710380e1adb7e9 + .quad 0x3feffffffc355dfd, 0x3e6febc107d5efaa + .quad 0x3feffffffc733572, 0x3e6df0f2a0ee6947 + .quad 0x3feffffffcad3626, 0x3e6c14b2188bcee4 + .quad 0x3feffffffce39b67, 0x3e6a553644f7f07d + .quad 0x3feffffffd169d0c, 0x3e68b0cfce0579e0 + .quad 0x3feffffffd466fa5, 0x3e6725e7c5dd20f7 + .quad 0x3feffffffd7344aa, 0x3e65b2fe547a1340 + .quad 0x3feffffffd9d4aab, 0x3e6456a974e92e93 + .quad 0x3feffffffdc4ad7a, 0x3e630f93c3699078 + .quad 0x3feffffffde9964e, 0x3e61dc7b5b978cf8 + .quad 0x3feffffffe0c2bf0, 0x3e60bc30c5d52f15 + .quad 0x3feffffffe2c92db, 0x3e5f5b2be65a0c7f + .quad 0x3feffffffe4aed5e, 0x3e5d5f3a8dea7357 + .quad 0x3feffffffe675bbd, 0x3e5b82915b03515b + .quad 0x3feffffffe81fc4e, 0x3e59c3517e789488 + .quad 0x3feffffffe9aeb97, 0x3e581fb7df06136e + .quad 0x3feffffffeb24467, 0x3e56961b8d641d06 + .quad 0x3feffffffec81ff2, 0x3e5524ec4d916cae + .quad 0x3feffffffedc95e7, 0x3e53cab1343d18d1 + .quad 0x3feffffffeefbc85, 0x3e52860757487a01 + .quad 0x3fefffffff01a8b6, 0x3e5155a09065d4f7 + .quad 0x3fefffffff126e1e, 0x3e50384250e4c9fc + .quad 0x3fefffffff221f30, 0x3e4e59890b926c78 + .quad 0x3fefffffff30cd3f, 0x3e4c642116a8a9e3 + .quad 0x3fefffffff3e8892, 0x3e4a8e405e651ab6 + .quad 0x3fefffffff4b606f, 0x3e48d5f98114f872 + .quad 0x3fefffffff57632d, 0x3e47397c5a66e307 + .quad 0x3fefffffff629e44, 0x3e45b71456c5a4c4 + .quad 0x3fefffffff6d1e56, 0x3e444d26de513197 + .quad 0x3fefffffff76ef3f, 0x3e42fa31d6371537 + .quad 0x3fefffffff801c1f, 0x3e41bcca373b7b43 + .quad 0x3fefffffff88af67, 0x3e40939ab853339f + .quad 0x3fefffffff90b2e3, 0x3e3efac5187b2863 + .quad 0x3fefffffff982fc1, 0x3e3cf1e86235d0e7 + .quad 0x3fefffffff9f2e9f, 0x3e3b0a68a2128bab + .quad 0x3fefffffffa5b790, 0x3e39423165bc4444 + .quad 0x3fefffffffabd229, 0x3e37974e743dea3d + .quad 0x3fefffffffb18582, 0x3e3607e9eacd1050 + .quad 0x3fefffffffb6d844, 0x3e34924a74dec729 + .quad 0x3fefffffffbbd0aa, 0x3e3334d19e0c2160 + .quad 0x3fefffffffc0748f, 0x3e31edfa3c5f5cca + .quad 0x3fefffffffc4c96c, 0x3e30bc56f1b54701 + .quad 0x3fefffffffc8d462, 0x3e2f3d2185e047d9 + .quad 0x3fefffffffcc9a41, 0x3e2d26cb87945e87 + .quad 0x3fefffffffd01f89, 0x3e2b334fac4b9f99 + .quad 0x3fefffffffd36871, 0x3e296076f7918d1c + .quad 0x3fefffffffd678ed, 0x3e27ac2d72fc2c63 + .quad 0x3fefffffffd954ae, 0x3e2614801550319e + .quad 0x3fefffffffdbff2a, 0x3e24979ac8b28927 + .quad 0x3fefffffffde7ba0, 0x3e2333c68e2d0548 + .quad 0x3fefffffffe0cd16, 0x3e21e767bce37dd7 + .quad 0x3fefffffffe2f664, 0x3e20b0fc5b6d05a0 + .quad 0x3fefffffffe4fa30, 0x3e1f1e3523b41d7d + .quad 0x3fefffffffe6daf7, 0x3e1d00de6608effe + .quad 0x3fefffffffe89b0c, 0x3e1b0778b7b3301b + .quad 0x3fefffffffea3c9a, 0x3e192fb04ec0f6cf + .quad 0x3fefffffffebc1a9, 0x3e177756ec9f78fa + .quad 0x3fefffffffed2c21, 0x3e15dc61922d5a06 + .quad 0x3fefffffffee7dc8, 0x3e145ce65699ff6d + .quad 0x3fefffffffefb847, 0x3e12f71a5f159970 + .quad 0x3feffffffff0dd2b, 0x3e11a94ff571654f + .quad 0x3feffffffff1ede9, 0x3e1071f4bbea09ec + .quad 0x3feffffffff2ebda, 0x3e0e9f1ff8ddd774 + .quad 0x3feffffffff3d843, 0x3e0c818223a202c7 + .quad 0x3feffffffff4b453, 0x3e0a887bd2b4404d + .quad 0x3feffffffff58126, 0x3e08b1a336c5eb6b + .quad 0x3feffffffff63fc3, 0x3e06fab63324088a + .quad 0x3feffffffff6f121, 0x3e056197e30205ba + .quad 0x3feffffffff79626, 0x3e03e44e45301b92 + .quad 0x3feffffffff82fab, 0x3e0281000bfe4c3f + .quad 0x3feffffffff8be77, 0x3e0135f28f2d50b4 + .quad 0x3feffffffff94346, 0x3e000187dded5975 + .quad 0x3feffffffff9bec8, 0x3dfdc479de0ef001 + .quad 0x3feffffffffa319f, 0x3dfbad4fdad3caa1 + .quad 0x3feffffffffa9c63, 0x3df9baed3ed27ab8 + .quad 0x3feffffffffaffa4, 0x3df7ead9ce4285bb + .quad 0x3feffffffffb5be5, 0x3df63ac6b4edc88e + .quad 0x3feffffffffbb1a2, 0x3df4a88be2a6390c + .quad 0x3feffffffffc014e, 0x3df332259185f1a0 + .quad 0x3feffffffffc4b56, 0x3df1d5b1f3793044 + .quad 0x3feffffffffc901c, 0x3df0916f04b6e18b + .quad 0x3feffffffffccfff, 0x3deec77101de6926 + .quad 0x3feffffffffd0b56, 0x3dec960bf23153e0 + .quad 0x3feffffffffd4271, 0x3dea8bd20fc65ef7 + .quad 0x3feffffffffd759d, 0x3de8a61745ec7d1d + .quad 0x3feffffffffda520, 0x3de6e25d0e756261 + .quad 0x3feffffffffdd13c, 0x3de53e4f7d1666cb + .quad 0x3feffffffffdfa2d, 0x3de3b7c27a7ddb0e + .quad 0x3feffffffffe202d, 0x3de24caf2c32af14 + .quad 0x3feffffffffe4371, 0x3de0fb3186804d0f + .quad 0x3feffffffffe642a, 0x3ddf830c0bb41fd7 + .quad 0x3feffffffffe8286, 0x3ddd3c0f1a91c846 + .quad 0x3feffffffffe9eb0, 0x3ddb1e5acf351d87 + .quad 0x3feffffffffeb8d0, 0x3dd92712d259ce66 + .quad 0x3feffffffffed10a, 0x3dd7538c60a04476 + .quad 0x3feffffffffee782, 0x3dd5a14b04b47879 + .quad 0x3feffffffffefc57, 0x3dd40dfd87456f4c + .quad 0x3fefffffffff0fa7, 0x3dd2977b1172b9d5 + .quad 0x3fefffffffff218f, 0x3dd13bc07e891491 + .quad 0x3fefffffffff3227, 0x3dcff1dbb4300811 + .quad 0x3fefffffffff4188, 0x3dcd9a880f306bd8 + .quad 0x3fefffffffff4fc9, 0x3dcb6e45220b55e0 + .quad 0x3fefffffffff5cfd, 0x3dc96a0b33f2c4da + .quad 0x3fefffffffff6939, 0x3dc78b07e9e924ac + .quad 0x3fefffffffff748e, 0x3dc5ce9ab1670dd2 + .quad 0x3fefffffffff7f0d, 0x3dc4325167006bb0 + .quad 0x3fefffffffff88c5, 0x3dc2b3e53538ff3f + .quad 0x3fefffffffff91c6, 0x3dc15137a7f44864 + .quad 0x3fefffffffff9a1b, 0x3dc0084ff125639d + .quad 0x3fefffffffffa1d2, 0x3dbdaeb0b7311ec7 + .quad 0x3fefffffffffa8f6, 0x3dbb7937d1c40c53 + .quad 0x3fefffffffffaf92, 0x3db96d082f59ab06 + .quad 0x3fefffffffffb5b0, 0x3db7872d9fa10aad + .quad 0x3fefffffffffbb58, 0x3db5c4e8e37bc7d0 + .quad 0x3fefffffffffc095, 0x3db423ac0df49a40 + .quad 0x3fefffffffffc56d, 0x3db2a117230ad284 + .quad 0x3fefffffffffc9e8, 0x3db13af4f04f9998 + .quad 0x3fefffffffffce0d, 0x3dafde703724e560 + .quad 0x3fefffffffffd1e1, 0x3dad77f0c82e7641 + .quad 0x3fefffffffffd56c, 0x3dab3ee02611d7dd + .quad 0x3fefffffffffd8b3, 0x3da92ff33023d5bd + .quad 0x3fefffffffffdbba, 0x3da7481a9e69f53f + .quad 0x3fefffffffffde86, 0x3da5847eda620959 + .quad 0x3fefffffffffe11d, 0x3da3e27c1fcc74bd + .quad 0x3fefffffffffe380, 0x3da25f9ee0b923dc + .quad 0x3fefffffffffe5b6, 0x3da0f9a068653200 + .quad 0x3fefffffffffe7c0, 0x3d9f5cc7718082b0 + .quad 0x3fefffffffffe9a2, 0x3d9cf7e53d6a2ca5 + .quad 0x3fefffffffffeb60, 0x3d9ac0f5f3229372 + .quad 0x3fefffffffffecfb, 0x3d98b498644847ea + .quad 0x3fefffffffffee77, 0x3d96cfa9bcca59dc + .quad 0x3fefffffffffefd6, 0x3d950f411d4fd2cd + .quad 0x3feffffffffff11a, 0x3d9370ab8327af5e + .quad 0x3feffffffffff245, 0x3d91f167f88c6b6e + .quad 0x3feffffffffff359, 0x3d908f24085d4597 + .quad 0x3feffffffffff457, 0x3d8e8f70e181d61a + .quad 0x3feffffffffff542, 0x3d8c324c20e337dc + .quad 0x3feffffffffff61b, 0x3d8a03261574b54e + .quad 0x3feffffffffff6e3, 0x3d87fe903cdf5855 + .quad 0x3feffffffffff79b, 0x3d86215c58da3450 + .quad 0x3feffffffffff845, 0x3d846897d4b69fc6 + .quad 0x3feffffffffff8e2, 0x3d82d1877d731b7b + .quad 0x3feffffffffff973, 0x3d8159a386b11517 + .quad 0x3feffffffffff9f8, 0x3d7ffd27ae9393ce + .quad 0x3feffffffffffa73, 0x3d7d7c593130dd0b + .quad 0x3feffffffffffae4, 0x3d7b2cd607c79bcf + .quad 0x3feffffffffffb4c, 0x3d790ae4d3405651 + .quad 0x3feffffffffffbad, 0x3d771312dd1759e2 + .quad 0x3feffffffffffc05, 0x3d75422ef5d8949d + .quad 0x3feffffffffffc57, 0x3d739544b0ecc957 + .quad 0x3feffffffffffca2, 0x3d720997f73e73dd + .quad 0x3feffffffffffce7, 0x3d709ca0eaacd277 + .quad 0x3feffffffffffd27, 0x3d6e9810295890ec + .quad 0x3feffffffffffd62, 0x3d6c2b45b5aa4a1d + .quad 0x3feffffffffffd98, 0x3d69eee068fa7596 + .quad 0x3feffffffffffdca, 0x3d67df2b399c10a8 + .quad 0x3feffffffffffdf8, 0x3d65f8b87a31bd85 + .quad 0x3feffffffffffe22, 0x3d64385c96e9a2d9 + .quad 0x3feffffffffffe49, 0x3d629b2933ef4cbc + .quad 0x3feffffffffffe6c, 0x3d611e68a6378f8a + .quad 0x3feffffffffffe8d, 0x3d5f7f338086a86b + .quad 0x3feffffffffffeab, 0x3d5cf8d7d9ce040a + .quad 0x3feffffffffffec7, 0x3d5aa577251ae485 + .quad 0x3feffffffffffee1, 0x3d58811d739efb5f + .quad 0x3feffffffffffef8, 0x3d568823e52970be + .quad 0x3fefffffffffff0e, 0x3d54b72ae68e8b4c + .quad 0x3fefffffffffff22, 0x3d530b14dbe876bc + .quad 0x3fefffffffffff34, 0x3d5181012ef86610 + .quad 0x3fefffffffffff45, 0x3d501647ba798745 + .quad 0x3fefffffffffff54, 0x3d4d90e917701675 + .quad 0x3fefffffffffff62, 0x3d4b2a87e86d0c8a + .quad 0x3fefffffffffff6f, 0x3d48f53dcb377293 + .quad 0x3fefffffffffff7b, 0x3d46ed2f2515e933 + .quad 0x3fefffffffffff86, 0x3d450ecc9ed47f19 + .quad 0x3fefffffffffff90, 0x3d4356cd5ce7799e + .quad 0x3fefffffffffff9a, 0x3d41c229a587ab78 + .quad 0x3fefffffffffffa2, 0x3d404e15ecc7f3f6 + .quad 0x3fefffffffffffaa, 0x3d3deffc7e6a6017 + .quad 0x3fefffffffffffb1, 0x3d3b7b040832f310 + .quad 0x3fefffffffffffb8, 0x3d3938e021f36d76 + .quad 0x3fefffffffffffbe, 0x3d37258610b3b233 + .quad 0x3fefffffffffffc3, 0x3d353d3bfc82a909 + .quad 0x3fefffffffffffc8, 0x3d337c92babdc2fd + .quad 0x3fefffffffffffcd, 0x3d31e06010120f6a + .quad 0x3fefffffffffffd1, 0x3d3065b9616170d4 + .quad 0x3fefffffffffffd5, 0x3d2e13dd96b3753b + .quad 0x3fefffffffffffd9, 0x3d2b950d32467392 + .quad 0x3fefffffffffffdc, 0x3d294a72263259a5 + .quad 0x3fefffffffffffdf, 0x3d272fd93e036cdc + .quad 0x3fefffffffffffe2, 0x3d254164576929ab + .quad 0x3fefffffffffffe4, 0x3d237b83c521fe96 + .quad 0x3fefffffffffffe7, 0x3d21daf033182e96 + .quad 0x3fefffffffffffe9, 0x3d205ca50205d26a + .quad 0x3fefffffffffffeb, 0x3d1dfbb6235639fa + .quad 0x3fefffffffffffed, 0x3d1b7807e294781f + .quad 0x3fefffffffffffee, 0x3d19298add70a734 + .quad 0x3feffffffffffff0, 0x3d170beaf9c7ffb6 + .quad 0x3feffffffffffff1, 0x3d151b2cd6709222 + .quad 0x3feffffffffffff3, 0x3d1353a6cf7f7fff + .quad 0x3feffffffffffff4, 0x3d11b1fa8cbe84a7 + .quad 0x3feffffffffffff5, 0x3d10330f0fd69921 + .quad 0x3feffffffffffff6, 0x3d0da81670f96f9b + .quad 0x3feffffffffffff7, 0x3d0b24a16b4d09aa + .quad 0x3feffffffffffff7, 0x3d08d6eeb6efdbd6 + .quad 0x3feffffffffffff8, 0x3d06ba91ac734786 + .quad 0x3feffffffffffff9, 0x3d04cb7966770ab5 + .quad 0x3feffffffffffff9, 0x3d0305e9721d0981 + .quad 0x3feffffffffffffa, 0x3d01667311fff70a + .quad 0x3feffffffffffffb, 0x3cffd3de10d62855 + .quad 0x3feffffffffffffb, 0x3cfd1aefbcd48d0c + .quad 0x3feffffffffffffb, 0x3cfa9cc93c25aca9 + .quad 0x3feffffffffffffc, 0x3cf85487ee3ea735 + .quad 0x3feffffffffffffc, 0x3cf63daf8b4b1e0c + .quad 0x3feffffffffffffd, 0x3cf45421e69a6ca1 + .quad 0x3feffffffffffffd, 0x3cf294175802d99a + .quad 0x3feffffffffffffd, 0x3cf0fa17bf41068f + .quad 0x3feffffffffffffd, 0x3cef05e82aae2bb9 + .quad 0x3feffffffffffffe, 0x3cec578101b29058 + .quad 0x3feffffffffffffe, 0x3ce9e39dc5dd2f7c + .quad 0x3feffffffffffffe, 0x3ce7a553a728bbf2 + .quad 0x3feffffffffffffe, 0x3ce5982008db1304 + .quad 0x3feffffffffffffe, 0x3ce3b7e00422e51b + .quad 0x3feffffffffffffe, 0x3ce200c898d9ee3e + .quad 0x3fefffffffffffff, 0x3ce06f5f7eb65a56 + .quad 0x3fefffffffffffff, 0x3cde00e9148a1d25 + .quad 0x3fefffffffffffff, 0x3cdb623734024e92 + .quad 0x3fefffffffffffff, 0x3cd8fd4e01891bf8 + .quad 0x3fefffffffffffff, 0x3cd6cd44c7470d89 + .quad 0x3fefffffffffffff, 0x3cd4cd9c04158cd7 + .quad 0x3fefffffffffffff, 0x3cd2fa34bf5c8344 + .quad 0x3fefffffffffffff, 0x3cd14f4890ff2461 + .quad 0x3fefffffffffffff, 0x3ccf92c49dfa4df5 + .quad 0x3fefffffffffffff, 0x3ccccaaea71ab0df + .quad 0x3fefffffffffffff, 0x3cca40829f001197 + .quad 0x3ff0000000000000, 0x3cc7eef13b59e96c + .quad 0x3ff0000000000000, 0x3cc5d11e1a252bf5 + .quad 0x3ff0000000000000, 0x3cc3e296303b2297 + .quad 0x3ff0000000000000, 0x3cc21f47009f43ce + .quad 0x3ff0000000000000, 0x3cc083768c5e4542 + .quad 0x3ff0000000000000, 0x3cbe1777d831265f + .quad 0x3ff0000000000000, 0x3cbb69f10b0191b5 + .quad 0x3ff0000000000000, 0x3cb8f8a3a05b5b53 + .quad 0x3ff0000000000000, 0x3cb6be573c40c8e7 + .quad 0x3ff0000000000000, 0x3cb4b645ba991fdb + .align 64 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff /* _AbsMask */ + .align 64 + .quad 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000, 0x4017f80000000000 /* _MaxThreshold = 6.0 - 1.0/128.0 */ + .align 64 + .quad 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000, 0x42c0000000000000 /* SRound */ + .align 64 + .quad 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000, 0x2ff0000000000000 /* _U2THreshold */ + .align 64 + .quad 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5, 0xbfa6c16db05bdea5 /* _poly_1_0 */ + .align 64 + .quad 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1, 0x3fc1111235a363b1 /* _poly_1_1 */ + .align 64 + .quad 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57, 0x3fcc71ca1c71eb57 /* _poly_3_0 */ + .align 64 + .quad 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8, 0xbfd9999c2be2dda8 /* _poly_3_1 */ + .align 64 + .quad 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F, 0xbfc5555800001B4F /* _poly_5_0 */ + .align 64 + .quad 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122, 0x3fb9999E2BE2F122 /* _poly_5_1 */ + .align 64 + .quad 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6, 0xbfd55555555547f6 /* _poly_1_2 */ + .align 64 + .quad 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd, 0x3fdfffffffffd4cd /* _poly_3_2 */ + .align 64 + .quad 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c, 0x3fe5555555554b0c /* _poly_1_3 */ + .align 64 + .quad 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555, 0xbfd5555555555555 /* _poly_3_3 */ + .align 64 + .quad 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff, 0x00000000ffffffff /* _Mask32 */ + .align 64 + .type __svml_derf_data_internal,@object + .size __svml_derf_data_internal,.-__svml_derf_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core-avx2.S new file mode 100644 index 0000000000..852a247f83 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized erff. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_erff _ZGVeN16v_erff_avx2_wrapper +#include "../svml_s_erff16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core.c new file mode 100644 index 0000000000..5714eaf023 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized erff, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_erff +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_erff, __GI__ZGVeN16v_erff, + __redirect__ZGVeN16v_erff) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core_avx512.S new file mode 100644 index 0000000000..5cdc8a77f7 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff16_core_avx512.S @@ -0,0 +1,185 @@ +/* Function erff vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * erf(x) is computed as higher precision simple polynomial + * with no lookup table: + * + * R = P0 + x^2*(P1 + x^2*(P2 + .... x^2*P12)); + * erf(x) = R * R * x; + * + * Special cases: + * + * erf(0) = 0 + * erf(+INF) = +1 + * erf(-INF) = -1 + * erf(QNaN) = QNaN + * erf(SNaN) = QNaN + * + */ + +/* Offsets for data table __svml_serf_data_internal + */ +#define _AbsMask 0 +#define _One 64 +#define _gf_MaxThreshold_LA 128 +#define _gf_la_poly_0 192 +#define _gf_la_poly_1 256 +#define _gf_la_poly_2 320 +#define _gf_la_poly_3 384 +#define _gf_la_poly_4 448 +#define _gf_la_poly_5 512 +#define _gf_la_poly_6 576 +#define _gf_la_poly_7 640 +#define _gf_la_poly_8 704 +#define _gf_la_poly_9 768 +#define _gf_la_poly_10 832 +#define _gf_la_poly_11 896 +#define _gf_la_poly_12 960 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_erff_skx) + vmovaps %zmm0, %zmm8 + vmulps {rn-sae}, %zmm8, %zmm8, %zmm11 + vmovups _gf_la_poly_11+__svml_serf_data_internal(%rip), %zmm15 + vmovups _gf_la_poly_12+__svml_serf_data_internal(%rip), %zmm10 + vmovups _gf_la_poly_10+__svml_serf_data_internal(%rip), %zmm9 + vmovups _gf_la_poly_9+__svml_serf_data_internal(%rip), %zmm7 + vmovups _gf_la_poly_8+__svml_serf_data_internal(%rip), %zmm0 + vmovups _gf_la_poly_7+__svml_serf_data_internal(%rip), %zmm1 + vmovups _gf_la_poly_6+__svml_serf_data_internal(%rip), %zmm2 + vmovups _gf_la_poly_5+__svml_serf_data_internal(%rip), %zmm3 + vmovups _gf_la_poly_4+__svml_serf_data_internal(%rip), %zmm4 + vmovups _gf_la_poly_3+__svml_serf_data_internal(%rip), %zmm5 + vmovups _gf_la_poly_2+__svml_serf_data_internal(%rip), %zmm6 + vextractf32x8 $1, %zmm8, %ymm13 + vcvtps2pd {sae}, %ymm8, %zmm12 + vcvtps2pd {sae}, %ymm13, %zmm14 + vmulpd {rn-sae}, %zmm12, %zmm12, %zmm12 + vmulpd {rn-sae}, %zmm14, %zmm14, %zmm13 + +/* R = P0 + x^2*(P1 + x^2*(P2 + .... x^2*P12)); */ + vmovaps %zmm15, %zmm14 + vfmadd231pd {rn-sae}, %zmm12, %zmm10, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm10, %zmm15 + vmovups _gf_la_poly_1+__svml_serf_data_internal(%rip), %zmm10 + vfmadd213pd {rn-sae}, %zmm9, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm15, %zmm9 + vfmadd213pd {rn-sae}, %zmm7, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm9, %zmm7 + vfmadd213pd {rn-sae}, %zmm0, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm7, %zmm0 + vmovups _gf_MaxThreshold_LA+__svml_serf_data_internal(%rip), %zmm7 + vfmadd213pd {rn-sae}, %zmm1, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm0, %zmm1 + vmovups _gf_la_poly_0+__svml_serf_data_internal(%rip), %zmm0 + vcmpps $22, {sae}, %zmm11, %zmm7, %k1 + vfmadd213pd {rn-sae}, %zmm2, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm1, %zmm2 + vfmadd213pd {rn-sae}, %zmm3, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm2, %zmm3 + vfmadd213pd {rn-sae}, %zmm4, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm3, %zmm4 + vfmadd213pd {rn-sae}, %zmm5, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm4, %zmm5 + vfmadd213pd {rn-sae}, %zmm6, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm5, %zmm6 + vmovups _AbsMask+__svml_serf_data_internal(%rip), %zmm5 + vfmadd213pd {rn-sae}, %zmm10, %zmm12, %zmm14 + vfmadd231pd {rn-sae}, %zmm13, %zmm6, %zmm10 + vandnps %zmm8, %zmm5, %zmm6 + vfmadd213pd {rn-sae}, %zmm0, %zmm14, %zmm12 + vfmadd213pd {rn-sae}, %zmm0, %zmm10, %zmm13 + vorps _One+__svml_serf_data_internal(%rip), %zmm6, %zmm0 + vmulpd {rn-sae}, %zmm12, %zmm12, %zmm1 + vmulpd {rn-sae}, %zmm13, %zmm13, %zmm3 + vcvtpd2ps {rn-sae}, %zmm1, %ymm2 + vcvtpd2ps {rn-sae}, %zmm3, %ymm4 + vinsertf32x8 $1, %ymm4, %zmm2, %zmm9 + +/* erf(x) = R * R * x; */ + vmulps {rn-sae}, %zmm8, %zmm9, %zmm0{%k1} + ret + +END(_ZGVeN16v_erff_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_serf_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _AbsMask[16][1]; + __declspec(align(64)) VUINT32 _One[16][1]; + __declspec(align(64)) VUINT32 _gf_MaxThreshold_LA[16][1]; + __declspec(align(64)) VUINT32 _gf_la_poly_0[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_1[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_2[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_3[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_4[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_5[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_6[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_7[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_8[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_9[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_10[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_11[8][2]; + __declspec(align(64)) VUINT32 _gf_la_poly_12[8][2]; +} __svml_serf_data_internal; +#endif +__svml_serf_data_internal: + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _AbsMask */ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 /* _One */ + .align 64 + .long 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a, 0x41558c5a /* _gf_MaxThreshold_LA */ + .align 64 + .quad 0x3ff0fefbd933b903, 0x3ff0fefbd933b903, 0x3ff0fefbd933b903, 0x3ff0fefbd933b903, 0x3ff0fefbd933b903, 0x3ff0fefbd933b903, 0x3ff0fefbd933b903, 0x3ff0fefbd933b903 /* _gf_la_poly_0 */ + .align 64 + .quad 0xbfc6a948101e6367, 0xbfc6a948101e6367, 0xbfc6a948101e6367, 0xbfc6a948101e6367, 0xbfc6a948101e6367, 0xbfc6a948101e6367, 0xbfc6a948101e6367, 0xbfc6a948101e6367 /* _gf_la_poly_1 */ + .align 64 + .quad 0x3fa3a334ce602c6b, 0x3fa3a334ce602c6b, 0x3fa3a334ce602c6b, 0x3fa3a334ce602c6b, 0x3fa3a334ce602c6b, 0x3fa3a334ce602c6b, 0x3fa3a334ce602c6b, 0x3fa3a334ce602c6b /* _gf_la_poly_2 */ + .align 64 + .quad 0xbf799309ea0c81dc, 0xbf799309ea0c81dc, 0xbf799309ea0c81dc, 0xbf799309ea0c81dc, 0xbf799309ea0c81dc, 0xbf799309ea0c81dc, 0xbf799309ea0c81dc, 0xbf799309ea0c81dc /* _gf_la_poly_3 */ + .align 64 + .quad 0x3f476df64a40e392, 0x3f476df64a40e392, 0x3f476df64a40e392, 0x3f476df64a40e392, 0x3f476df64a40e392, 0x3f476df64a40e392, 0x3f476df64a40e392, 0x3f476df64a40e392 /* _gf_la_poly_4 */ + .align 64 + .quad 0xbf0a5216b9508ede, 0xbf0a5216b9508ede, 0xbf0a5216b9508ede, 0xbf0a5216b9508ede, 0xbf0a5216b9508ede, 0xbf0a5216b9508ede, 0xbf0a5216b9508ede, 0xbf0a5216b9508ede /* _gf_la_poly_5 */ + .align 64 + .quad 0x3ea5794b95c8e8a0, 0x3ea5794b95c8e8a0, 0x3ea5794b95c8e8a0, 0x3ea5794b95c8e8a0, 0x3ea5794b95c8e8a0, 0x3ea5794b95c8e8a0, 0x3ea5794b95c8e8a0, 0x3ea5794b95c8e8a0 /* _gf_la_poly_6 */ + .align 64 + .quad 0x3e94b6c0b485f30f, 0x3e94b6c0b485f30f, 0x3e94b6c0b485f30f, 0x3e94b6c0b485f30f, 0x3e94b6c0b485f30f, 0x3e94b6c0b485f30f, 0x3e94b6c0b485f30f, 0x3e94b6c0b485f30f /* _gf_la_poly_7 */ + .align 64 + .quad 0xbe65806ce17f0523, 0xbe65806ce17f0523, 0xbe65806ce17f0523, 0xbe65806ce17f0523, 0xbe65806ce17f0523, 0xbe65806ce17f0523, 0xbe65806ce17f0523, 0xbe65806ce17f0523 /* _gf_la_poly_8 */ + .align 64 + .quad 0x3e2715640470db47, 0x3e2715640470db47, 0x3e2715640470db47, 0x3e2715640470db47, 0x3e2715640470db47, 0x3e2715640470db47, 0x3e2715640470db47, 0x3e2715640470db47 /* _gf_la_poly_9 */ + .align 64 + .quad 0xbdddcb2653d80f03, 0xbdddcb2653d80f03, 0xbdddcb2653d80f03, 0xbdddcb2653d80f03, 0xbdddcb2653d80f03, 0xbdddcb2653d80f03, 0xbdddcb2653d80f03, 0xbdddcb2653d80f03 /* _gf_la_poly_10 */ + .align 64 + .quad 0x3d85eadfc762d3eb, 0x3d85eadfc762d3eb, 0x3d85eadfc762d3eb, 0x3d85eadfc762d3eb, 0x3d85eadfc762d3eb, 0x3d85eadfc762d3eb, 0x3d85eadfc762d3eb, 0x3d85eadfc762d3eb /* _gf_la_poly_11 */ + .align 64 + .quad 0xbd1c668a2871f0f1, 0xbd1c668a2871f0f1, 0xbd1c668a2871f0f1, 0xbd1c668a2871f0f1, 0xbd1c668a2871f0f1, 0xbd1c668a2871f0f1, 0xbd1c668a2871f0f1, 0xbd1c668a2871f0f1 /* _gf_la_poly_12 */ + .align 64 + .type __svml_serf_data_internal,@object + .size __svml_serf_data_internal,.-__svml_serf_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core-sse2.S new file mode 100644 index 0000000000..651fd267a5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized erff, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_erff _ZGVbN4v_erff_sse2 +#include "../svml_s_erff4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core.c new file mode 100644 index 0000000000..02286a68c6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized erff, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_erff +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_erff, __GI__ZGVbN4v_erff, + __redirect__ZGVbN4v_erff) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core_sse4.S new file mode 100644 index 0000000000..fb4b08f53b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff4_core_sse4.S @@ -0,0 +1,661 @@ +/* Function erff vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Basic formula is + * erf(x) ~ erf(x0) + + * + exp(-x0*x0)*D*(1+c0+T*P1(T)+D^2*P3(T)+D^4*p5) + * where D=x-x0, T=x0*D + * x0 is x rounded to a specified number of fractional bits (in this case 8), + * except that x0=0 for |x|<3.5/256.0 (using x0=0 for first 4 table entries) + * + * Data table packs both erf(x0)_high and a few bits of erf(x0)_low in one + * entry (in place of redundant exponent bits) + * + */ + +/* Offsets for data table __svml_serf_data_internal + */ +#define _erf_tbl 0 +#define _AbsMask 4032 +#define _MaxThreshold 4048 +#define _SRound 4064 +#define _U2Threshold 4080 +#define _poly3_0 4096 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_erff_sse4) + lea -1006632960+__svml_serf_data_internal(%rip), %rdi + movups _AbsMask+__svml_serf_data_internal(%rip), %xmm9 + andps %xmm0, %xmm9 + +/* + * erf(x) rounds to 1.0 for x>_MaxThreshold (3.9375) + * can compute all results in the main path + */ + movaps %xmm9, %xmm12 + +/* save sign */ + pxor %xmm9, %xmm0 + minps _MaxThreshold+__svml_serf_data_internal(%rip), %xmm12 + +/* + * vector gather: + * erf(x0), exp(-x0*x0)*2.0/sqrt(pi) + */ + movups _SRound+__svml_serf_data_internal(%rip), %xmm1 + movaps %xmm1, %xmm4 + movups _U2Threshold+__svml_serf_data_internal(%rip), %xmm11 + addps %xmm12, %xmm4 + cmpltps %xmm12, %xmm11 + movaps %xmm4, %xmm10 + pslld $3, %xmm4 + pshufd $1, %xmm4, %xmm2 + subps %xmm1, %xmm10 + movd %xmm4, %eax + movd %xmm2, %edx + pshufd $2, %xmm4, %xmm3 + subps %xmm10, %xmm12 + movd %xmm3, %ecx + andps %xmm12, %xmm11 + +/* D2 = Diff^2 */ + mulps %xmm11, %xmm11 + mulps %xmm12, %xmm10 + +/* NaN fixup */ + minps %xmm9, %xmm12 + +/* + * Start polynomial evaluation + * P1 + */ + mulps _poly3_0+__svml_serf_data_internal(%rip), %xmm11 + pshufd $3, %xmm4, %xmm5 + subps %xmm10, %xmm11 + movd %xmm5, %esi + +/* + * branch-free + * (exp_h(x0) * Diff) * (poly + 1.0) + */ + mulps %xmm12, %xmm11 + movslq %eax, %rax + addps %xmm11, %xmm12 + movslq %edx, %rdx + movslq %ecx, %rcx + movslq %esi, %rsi + movq (%rdi,%rax), %xmm13 + movq (%rdi,%rdx), %xmm6 + movq (%rdi,%rcx), %xmm8 + movq (%rdi,%rsi), %xmm7 + unpcklps %xmm6, %xmm13 + unpcklps %xmm7, %xmm8 + movaps %xmm13, %xmm14 + shufps $238, %xmm8, %xmm13 + +/* Final result */ + mulps %xmm12, %xmm13 + movlhps %xmm8, %xmm14 + addps %xmm13, %xmm14 + +/* set sign */ + orps %xmm14, %xmm0 + ret + +END(_ZGVbN4v_erff_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_serf_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _erf_tbl[1008][1]; + __declspec(align(16)) VUINT32 _AbsMask[4][1]; + __declspec(align(16)) VUINT32 _MaxThreshold[4][1]; + __declspec(align(16)) VUINT32 _SRound[4][1]; + __declspec(align(16)) VUINT32 _U2Threshold[4][1]; + __declspec(align(16)) VUINT32 _poly3_0[4][1]; +} __svml_serf_data_internal; +#endif +__svml_serf_data_internal: + /*== _erf_tbl ==*/ + .long 0x00000000, 0x3f906ebb + .long 0x3c106dfa, 0x3f906c79 + .long 0x3c906bb8, 0x3f9065b4 + .long 0x3cd89bf0, 0x3f905a6c + .long 0x3d1062b2, 0x3f904aa3 + .long 0x3d3472ea, 0x3f90365a + .long 0x3d587d7f, 0x3f901d93 + .long 0x3d7c8154, 0x3f900050 + .long 0x3d903ea4, 0x3f8fde94 + .long 0x3da2381f, 0x3f8fb862 + .long 0x3db42c8d, 0x3f8f8dbd + .long 0x3dc61b5f, 0x3f8f5eab + .long 0x3dd80409, 0x3f8f2b2e + .long 0x3de9e5fc, 0x3f8ef34c + .long 0x3dfbc0ad, 0x3f8eb70a + .long 0x3e06c9c8, 0x3f8e766e + .long 0x3e0faf0d, 0x3f8e317d + .long 0x3e188fe1, 0x3f8de83e + .long 0x3e216bfe, 0x3f8d9ab9 + .long 0x3e2a4321, 0x3f8d48f3 + .long 0x3e331506, 0x3f8cf2f5 + .long 0x3e3be169, 0x3f8c98c6 + .long 0x3e44a808, 0x3f8c3a6f + .long 0x3e4d68a1, 0x3f8bd7f8 + .long 0x3e5622f2, 0x3f8b716c + .long 0x3e5ed6b9, 0x3f8b06d2 + .long 0x3e6783b7, 0x3f8a9834 + .long 0x3e7029aa, 0x3f8a259e + .long 0x3e78c855, 0x3f89af18 + .long 0x3e80afbc, 0x3f8934af + .long 0x3e84f76b, 0x3f88b66c + .long 0x3e893b19, 0x3f88345d + .long 0x3e8d7aa7, 0x3f87ae8b + .long 0x3e91b5f8, 0x3f872504 + .long 0x3e95ecee, 0x3f8697d3 + .long 0x3e9a1f6b, 0x3f860705 + .long 0x3e9e4d54, 0x3f8572a8 + .long 0x3ea2768c, 0x3f84dac8 + .long 0x3ea69af8, 0x3f843f72 + .long 0x3eaaba7a, 0x3f83a0b6 + .long 0x3eaed4fa, 0x3f82fe9f + .long 0x3eb2ea5c, 0x3f82593e + .long 0x3eb6fa85, 0x3f81b0a0 + .long 0x3ebb055d, 0x3f8104d3 + .long 0x3ebf0aca, 0x3f8055e8 + .long 0x3ec30ab3, 0x3f7f47d8 + .long 0x3ec70501, 0x3f7ddddf + .long 0x3ecaf99b, 0x3f7c6e05 + .long 0x3ecee869, 0x3f7af867 + .long 0x3ed2d156, 0x3f797d26 + .long 0x3ed6b44b, 0x3f77fc62 + .long 0x3eda9132, 0x3f76763c + .long 0x3ede67f6, 0x3f74ead4 + .long 0x3ee23882, 0x3f735a4c + .long 0x3ee602c2, 0x3f71c4c4 + .long 0x3ee9c6a2, 0x3f702a5f + .long 0x3eed840e, 0x3f6e8b3e + .long 0x3ef13af5, 0x3f6ce783 + .long 0x3ef4eb45, 0x3f6b3f51 + .long 0x3ef894ea, 0x3f6992c9 + .long 0x3efc37d5, 0x3f67e20f + .long 0x3effd3f5, 0x3f662d45 + .long 0x3f01b49d, 0x3f64748e + .long 0x3f037bca, 0x3f62b80d + .long 0x3f053f7b, 0x3f60f7e5 + .long 0x3f06ffa8, 0x3f5f3439 + .long 0x3f08bc4a, 0x3f5d6d2d + .long 0x3f0a755a, 0x3f5ba2e3 + .long 0x3f0c2ad3, 0x3f59d57e + .long 0x3f0ddcae, 0x3f580523 + .long 0x3f0f8ae6, 0x3f5631f4 + .long 0x3f113574, 0x3f545c14 + .long 0x3f12dc54, 0x3f5283a7 + .long 0x3f147f81, 0x3f50a8cf + .long 0x3f161ef6, 0x3f4ecbb1 + .long 0x3f17baae, 0x3f4cec6d + .long 0x3f1952a6, 0x3f4b0b28 + .long 0x3f1ae6da, 0x3f492804 + .long 0x3f1c7745, 0x3f474323 + .long 0x3f1e03e5, 0x3f455ca8 + .long 0x3f1f8cb7, 0x3f4374b5 + .long 0x3f2111b7, 0x3f418b6b + .long 0x3f2292e4, 0x3f3fa0ee + .long 0x3f24103a, 0x3f3db55e + .long 0x3f2589b9, 0x3f3bc8dc + .long 0x3f26ff5d, 0x3f39db8a + .long 0x3f287126, 0x3f37ed89 + .long 0x3f29df13, 0x3f35fef8 + .long 0x3f2b4922, 0x3f340ff9 + .long 0x3f2caf53, 0x3f3220ab + .long 0x3f2e11a4, 0x3f30312e + .long 0x3f2f7017, 0x3f2e41a1 + .long 0x3f30caab, 0x3f2c5223 + .long 0x3f322160, 0x3f2a62d3 + .long 0x3f337437, 0x3f2873cf + .long 0x3f34c32f, 0x3f268534 + .long 0x3f360e4c, 0x3f249721 + .long 0x3f37558c, 0x3f22a9b3 + .long 0x3f3898f3, 0x3f20bd06 + .long 0x3f39d881, 0x3f1ed137 + .long 0x3f3b1438, 0x3f1ce661 + .long 0x3f3c4c1b, 0x3f1afca0 + .long 0x3f3d802c, 0x3f19140f + .long 0x3f3eb06c, 0x3f172cc9 + .long 0x3f3fdce0, 0x3f1546e7 + .long 0x3f410589, 0x3f136284 + .long 0x3f422a6b, 0x3f117fb9 + .long 0x3f434b89, 0x3f0f9e9e + .long 0x3f4468e7, 0x3f0dbf4c + .long 0x3f458287, 0x3f0be1db + .long 0x3f46986f, 0x3f0a0662 + .long 0x3f47aaa2, 0x3f082cf7 + .long 0x3f48b925, 0x3f0655b1 + .long 0x3f49c3fb, 0x3f0480a6 + .long 0x3f4acb29, 0x3f02adeb + .long 0x3f4bceb4, 0x3f00dd96 + .long 0x3f4ccea1, 0x3efe1f73 + .long 0x3f4dcaf4, 0x3efa88d5 + .long 0x3f4ec3b4, 0x3ef6f777 + .long 0x3f4fb8e5, 0x3ef36b80 + .long 0x3f50aa8d, 0x3eefe513 + .long 0x3f5198b1, 0x3eec6455 + .long 0x3f528358, 0x3ee8e968 + .long 0x3f536a86, 0x3ee5746d + .long 0x3f544e43, 0x3ee20584 + .long 0x3f552e93, 0x3ede9ccc + .long 0x3f560b7e, 0x3edb3a64 + .long 0x3f56e50a, 0x3ed7de6a + .long 0x3f57bb3d, 0x3ed488f8 + .long 0x3f588e1e, 0x3ed13a2b + .long 0x3f595db4, 0x3ecdf21c + .long 0x3f5a2a05, 0x3ecab0e4 + .long 0x3f5af318, 0x3ec7769b + .long 0x3f5bb8f4, 0x3ec44359 + .long 0x3f5c7ba1, 0x3ec11733 + .long 0x3f5d3b25, 0x3ebdf23d + .long 0x3f5df788, 0x3ebad48d + .long 0x3f5eb0d1, 0x3eb7be35 + .long 0x3f5f6707, 0x3eb4af46 + .long 0x3f601a32, 0x3eb1a7d3 + .long 0x3f60ca59, 0x3eaea7ea + .long 0x3f617784, 0x3eabaf9a + .long 0x3f6221bb, 0x3ea8bef3 + .long 0x3f62c905, 0x3ea5d600 + .long 0x3f636d69, 0x3ea2f4ce + .long 0x3f640ef1, 0x3ea01b68 + .long 0x3f64ada3, 0x3e9d49d9 + .long 0x3f654987, 0x3e9a8029 + .long 0x3f65e2a6, 0x3e97be62 + .long 0x3f667906, 0x3e95048b + .long 0x3f670cb1, 0x3e9252aa + .long 0x3f679dae, 0x3e8fa8c5 + .long 0x3f682c06, 0x3e8d06e3 + .long 0x3f68b7bf, 0x3e8a6d05 + .long 0x3f6940e2, 0x3e87db31 + .long 0x3f69c778, 0x3e855168 + .long 0x3f6a4b88, 0x3e82cfad + .long 0x3f6acd1a, 0x3e805600 + .long 0x3f6b4c36, 0x3e7bc8c2 + .long 0x3f6bc8e5, 0x3e76f5a0 + .long 0x3f6c432f, 0x3e723298 + .long 0x3f6cbb1b, 0x3e6d7fa5 + .long 0x3f6d30b1, 0x3e68dcc1 + .long 0x3f6da3fa, 0x3e6449e7 + .long 0x3f6e14fe, 0x3e5fc70e + .long 0x3f6e83c4, 0x3e5b542b + .long 0x3f6ef055, 0x3e56f136 + .long 0x3f6f5ab8, 0x3e529e21 + .long 0x3f6fc2f5, 0x3e4e5adf + .long 0x3f702915, 0x3e4a2761 + .long 0x3f708d1f, 0x3e460399 + .long 0x3f70ef1b, 0x3e41ef75 + .long 0x3f714f11, 0x3e3deae4 + .long 0x3f71ad09, 0x3e39f5d2 + .long 0x3f72090a, 0x3e36102b + .long 0x3f72631c, 0x3e3239db + .long 0x3f72bb46, 0x3e2e72cb + .long 0x3f731191, 0x3e2abae4 + .long 0x3f736604, 0x3e27120f + .long 0x3f73b8a5, 0x3e237833 + .long 0x3f74097e, 0x3e1fed36 + .long 0x3f745895, 0x3e1c70fd + .long 0x3f74a5f2, 0x3e19036e + .long 0x3f74f19b, 0x3e15a46d + .long 0x3f753b98, 0x3e1253dc + .long 0x3f7583f1, 0x3e0f119f + .long 0x3f75caac, 0x3e0bdd96 + .long 0x3f760fd1, 0x3e08b7a4 + .long 0x3f765366, 0x3e059fa9 + .long 0x3f769573, 0x3e029586 + .long 0x3f76d5fe, 0x3dff3230 + .long 0x3f77150f, 0x3df95481 + .long 0x3f7752ab, 0x3df391b9 + .long 0x3f778eda, 0x3dede995 + .long 0x3f77c9a2, 0x3de85bd0 + .long 0x3f78030a, 0x3de2e825 + .long 0x3f783b18, 0x3ddd8e4c + .long 0x3f7871d3, 0x3dd84dfe + .long 0x3f78a741, 0x3dd326f3 + .long 0x3f78db68, 0x3dce18e3 + .long 0x3f790e50, 0x3dc92385 + .long 0x3f793ffc, 0x3dc4468f + .long 0x3f797075, 0x3dbf81b6 + .long 0x3f799fbf, 0x3dbad4b0 + .long 0x3f79cde1, 0x3db63f32 + .long 0x3f79fae1, 0x3db1c0f1 + .long 0x3f7a26c4, 0x3dad59a1 + .long 0x3f7a518f, 0x3da908f6 + .long 0x3f7a7b4a, 0x3da4cea4 + .long 0x3f7aa3f9, 0x3da0aa5e + .long 0x3f7acba1, 0x3d9c9bd9 + .long 0x3f7af248, 0x3d98a2c7 + .long 0x3f7b17f4, 0x3d94bedd + .long 0x3f7b3ca9, 0x3d90efcd + .long 0x3f7b606e, 0x3d8d354b + .long 0x3f7b8346, 0x3d898f0a + .long 0x3f7ba537, 0x3d85fcbf + .long 0x3f7bc646, 0x3d827e1d + .long 0x3f7be677, 0x3d7e25af + .long 0x3f7c05d1, 0x3d777546 + .long 0x3f7c2456, 0x3d70ea68 + .long 0x3f7c420d, 0x3d6a847d + .long 0x3f7c5ef9, 0x3d6442f0 + .long 0x3f7c7b1f, 0x3d5e252a + .long 0x3f7c9684, 0x3d582a98 + .long 0x3f7cb12b, 0x3d5252a5 + .long 0x3f7ccb1a, 0x3d4c9cbd + .long 0x3f7ce454, 0x3d47084e + .long 0x3f7cfcdd, 0x3d4194c7 + .long 0x3f7d14ba, 0x3d3c4196 + .long 0x3f7d2bef, 0x3d370e2c + .long 0x3f7d427f, 0x3d31f9fb + .long 0x3f7d586f, 0x3d2d0474 + .long 0x3f7d6dc2, 0x3d282d0c + .long 0x3f7d827b, 0x3d237336 + .long 0x3f7d96a0, 0x3d1ed669 + .long 0x3f7daa32, 0x3d1a561b + .long 0x3f7dbd36, 0x3d15f1c6 + .long 0x3f7dcfb0, 0x3d11a8e1 + .long 0x3f7de1a2, 0x3d0d7ae9 + .long 0x3f7df30f, 0x3d09675a + .long 0x3f7e03fd, 0x3d056db0 + .long 0x3f7e146c, 0x3d018d6b + .long 0x3f7e2461, 0x3cfb8c15 + .long 0x3f7e33de, 0x3cf42e22 + .long 0x3f7e42e8, 0x3ced0003 + .long 0x3f7e517f, 0x3ce600c0 + .long 0x3f7e5fa9, 0x3cdf2f67 + .long 0x3f7e6d66, 0x3cd88b05 + .long 0x3f7e7abb, 0x3cd212ad + .long 0x3f7e87aa, 0x3ccbc574 + .long 0x3f7e9435, 0x3cc5a273 + .long 0x3f7ea05f, 0x3cbfa8c4 + .long 0x3f7eac2b, 0x3cb9d786 + .long 0x3f7eb79a, 0x3cb42ddb + .long 0x3f7ec2b1, 0x3caeaae6 + .long 0x3f7ecd71, 0x3ca94dcf + .long 0x3f7ed7dc, 0x3ca415c2 + .long 0x3f7ee1f4, 0x3c9f01ec + .long 0x3f7eebbd, 0x3c9a117f + .long 0x3f7ef537, 0x3c9543ae + .long 0x3f7efe66, 0x3c9097b1 + .long 0x3f7f074b, 0x3c8c0cc2 + .long 0x3f7f0fe8, 0x3c87a21f + .long 0x3f7f1840, 0x3c83570a + .long 0x3f7f2053, 0x3c7e558a + .long 0x3f7f2826, 0x3c763931 + .long 0x3f7f2fb8, 0x3c6e579b + .long 0x3f7f370c, 0x3c66af65 + .long 0x3f7f3e23, 0x3c5f3f2d + .long 0x3f7f4500, 0x3c58059c + .long 0x3f7f4ba4, 0x3c51015f + .long 0x3f7f5211, 0x3c4a3127 + .long 0x3f7f5848, 0x3c4393af + .long 0x3f7f5e4b, 0x3c3d27b5 + .long 0x3f7f641b, 0x3c36ebff + .long 0x3f7f69ba, 0x3c30df57 + .long 0x3f7f6f29, 0x3c2b008e + .long 0x3f7f746a, 0x3c254e7b + .long 0x3f7f797f, 0x3c1fc7fb + .long 0x3f7f7e67, 0x3c1a6bee + .long 0x3f7f8326, 0x3c15393d + .long 0x3f7f87bb, 0x3c102ed6 + .long 0x3f7f8c29, 0x3c0b4bab + .long 0x3f7f9070, 0x3c068eb5 + .long 0x3f7f9492, 0x3c01f6f1 + .long 0x3f7f9890, 0x3bfb06c5 + .long 0x3f7f9c6b, 0x3bf26625 + .long 0x3f7fa024, 0x3bea0a1d + .long 0x3f7fa3bc, 0x3be1f0d3 + .long 0x3f7fa734, 0x3bda1876 + .long 0x3f7faa8d, 0x3bd27f42 + .long 0x3f7fadc8, 0x3bcb237a + .long 0x3f7fb0e6, 0x3bc4036c + .long 0x3f7fb3e8, 0x3bbd1d6f + .long 0x3f7fb6cf, 0x3bb66fe6 + .long 0x3f7fb99c, 0x3baff93b + .long 0x3f7fbc4f, 0x3ba9b7e1 + .long 0x3f7fbeea, 0x3ba3aa56 + .long 0x3f7fc16d, 0x3b9dcf20 + .long 0x3f7fc3d9, 0x3b9824ce + .long 0x3f7fc62e, 0x3b92a9f7 + .long 0x3f7fc86e, 0x3b8d5d3c + .long 0x3f7fca99, 0x3b883d46 + .long 0x3f7fccb0, 0x3b8348c6 + .long 0x3f7fceb4, 0x3b7cfce8 + .long 0x3f7fd0a5, 0x3b73ba24 + .long 0x3f7fd283, 0x3b6ac6d3 + .long 0x3f7fd450, 0x3b622096 + .long 0x3f7fd60c, 0x3b59c51d + .long 0x3f7fd7b7, 0x3b51b22a + .long 0x3f7fd953, 0x3b49e589 + .long 0x3f7fdadf, 0x3b425d18 + .long 0x3f7fdc5c, 0x3b3b16c2 + .long 0x3f7fddcc, 0x3b341080 + .long 0x3f7fdf2d, 0x3b2d4858 + .long 0x3f7fe081, 0x3b26bc5e + .long 0x3f7fe1c8, 0x3b206ab2 + .long 0x3f7fe303, 0x3b1a5183 + .long 0x3f7fe431, 0x3b146f09 + .long 0x3f7fe554, 0x3b0ec18c + .long 0x3f7fe66c, 0x3b09475d + .long 0x3f7fe77a, 0x3b03feda + .long 0x3f7fe87d, 0x3afdccdc + .long 0x3f7fe975, 0x3af3f919 + .long 0x3f7fea65, 0x3aea7f6c + .long 0x3f7feb4b, 0x3ae15ce8 + .long 0x3f7fec27, 0x3ad88eb8 + .long 0x3f7fecfc, 0x3ad0121b + .long 0x3f7fedc8, 0x3ac7e464 + .long 0x3f7fee8c, 0x3ac002f8 + .long 0x3f7fef48, 0x3ab86b52 + .long 0x3f7feffd, 0x3ab11afe + .long 0x3f7ff0aa, 0x3aaa0f9a + .long 0x3f7ff151, 0x3aa346d7 + .long 0x3f7ff1f1, 0x3a9cbe77 + .long 0x3f7ff28a, 0x3a96744c + .long 0x3f7ff31e, 0x3a90663b + .long 0x3f7ff3ab, 0x3a8a9237 + .long 0x3f7ff433, 0x3a84f643 + .long 0x3f7ff4b5, 0x3a7f20e7 + .long 0x3f7ff532, 0x3a74bdd2 + .long 0x3f7ff5aa, 0x3a6abfa9 + .long 0x3f7ff61d, 0x3a6122ea + .long 0x3f7ff68b, 0x3a57e42f + .long 0x3f7ff6f5, 0x3a4f002c + .long 0x3f7ff75a, 0x3a4673af + .long 0x3f7ff7bb, 0x3a3e3ba2 + .long 0x3f7ff819, 0x3a365507 + .long 0x3f7ff872, 0x3a2ebcf6 + .long 0x3f7ff8c7, 0x3a2770a1 + .long 0x3f7ff919, 0x3a206d52 + .long 0x3f7ff968, 0x3a19b066 + .long 0x3f7ff9b3, 0x3a133754 + .long 0x3f7ff9fb, 0x3a0cffa3 + .long 0x3f7ffa40, 0x3a0706f4 + .long 0x3f7ffa82, 0x3a014af8 + .long 0x3f7ffac1, 0x39f792ea + .long 0x3f7ffafe, 0x39ed0088 + .long 0x3f7ffb38, 0x39e2daa1 + .long 0x3f7ffb6f, 0x39d91d2d + .long 0x3f7ffba5, 0x39cfc44a + .long 0x3f7ffbd7, 0x39c6cc35 + .long 0x3f7ffc08, 0x39be314d + .long 0x3f7ffc36, 0x39b5f011 + .long 0x3f7ffc63, 0x39ae051c + .long 0x3f7ffc8e, 0x39a66d2a + .long 0x3f7ffcb6, 0x399f2512 + .long 0x3f7ffcdd, 0x399829c8 + .long 0x3f7ffd02, 0x3991785a + .long 0x3f7ffd26, 0x398b0df2 + .long 0x3f7ffd48, 0x3984e7d2 + .long 0x3f7ffd68, 0x397e06ab + .long 0x3f7ffd87, 0x3972bbde + .long 0x3f7ffda5, 0x3967ea53 + .long 0x3f7ffdc1, 0x395d8d4b + .long 0x3f7ffddc, 0x3953a034 + .long 0x3f7ffdf6, 0x394a1ea5 + .long 0x3f7ffe0f, 0x3941045e + .long 0x3f7ffe27, 0x39384d47 + .long 0x3f7ffe3d, 0x392ff56d + .long 0x3f7ffe53, 0x3927f904 + .long 0x3f7ffe67, 0x39205461 + .long 0x3f7ffe7b, 0x391903fe + .long 0x3f7ffe8d, 0x39120475 + .long 0x3f7ffe9f, 0x390b5281 + .long 0x3f7ffeb0, 0x3904eafc + .long 0x3f7ffec0, 0x38fd95bd + .long 0x3f7ffed0, 0x38f1de7a + .long 0x3f7ffedf, 0x38e6aa94 + .long 0x3f7ffeed, 0x38dbf4a3 + .long 0x3f7ffefa, 0x38d1b776 + .long 0x3f7fff07, 0x38c7ee0e + .long 0x3f7fff13, 0x38be939c + .long 0x3f7fff1f, 0x38b5a381 + .long 0x3f7fff2a, 0x38ad194e + .long 0x3f7fff34, 0x38a4f0bc + .long 0x3f7fff3f, 0x389d25b0 + .long 0x3f7fff48, 0x3895b43b + .long 0x3f7fff51, 0x388e9890 + .long 0x3f7fff5a, 0x3887cf0e + .long 0x3f7fff62, 0x38815434 + .long 0x3f7fff6a, 0x3876494d + .long 0x3f7fff72, 0x386a7a5a + .long 0x3f7fff79, 0x385f355e + .long 0x3f7fff80, 0x38547466 + .long 0x3f7fff86, 0x384a31bf + .long 0x3f7fff8c, 0x384067ee + .long 0x3f7fff92, 0x383711b4 + .long 0x3f7fff98, 0x382e2a06 + .long 0x3f7fff9d, 0x3825ac0e + .long 0x3f7fffa2, 0x381d9329 + .long 0x3f7fffa7, 0x3815dae6 + .long 0x3f7fffab, 0x380e7f01 + .long 0x3f7fffb0, 0x38077b62 + .long 0x3f7fffb4, 0x3800cc21 + .long 0x3f7fffb8, 0x37f4daf4 + .long 0x3f7fffbc, 0x37e8b7ac + .long 0x3f7fffbf, 0x37dd2782 + .long 0x3f7fffc2, 0x37d223dc + .long 0x3f7fffc6, 0x37c7a666 + .long 0x3f7fffc9, 0x37bda912 + .long 0x3f7fffcc, 0x37b42611 + .long 0x3f7fffce, 0x37ab17d6 + .long 0x3f7fffd1, 0x37a2790f + .long 0x3f7fffd3, 0x379a44a5 + .long 0x3f7fffd6, 0x379275b9 + .long 0x3f7fffd8, 0x378b07a2 + .long 0x3f7fffda, 0x3783f5e9 + .long 0x3f7fffdc, 0x377a7897 + .long 0x3f7fffde, 0x376dad68 + .long 0x3f7fffe0, 0x37618278 + .long 0x3f7fffe2, 0x3755f04f + .long 0x3f7fffe3, 0x374aefcc + .long 0x3f7fffe5, 0x37407a1d + .long 0x3f7fffe6, 0x373688bc + .long 0x3f7fffe8, 0x372d1570 + .long 0x3f7fffe9, 0x37241a44 + .long 0x3f7fffea, 0x371b9188 + .long 0x3f7fffeb, 0x371375cf + .long 0x3f7fffec, 0x370bc1e7 + .long 0x3f7fffee, 0x370470dd + .long 0x3f7fffef, 0x36fafbec + .long 0x3f7fffef, 0x36edc95b + .long 0x3f7ffff0, 0x36e14167 + .long 0x3f7ffff1, 0x36d55bd6 + .long 0x3f7ffff2, 0x36ca10ce + .long 0x3f7ffff3, 0x36bf58d1 + .long 0x3f7ffff4, 0x36b52cb9 + .long 0x3f7ffff4, 0x36ab85b5 + .long 0x3f7ffff5, 0x36a25d43 + .long 0x3f7ffff5, 0x3699ad31 + .long 0x3f7ffff6, 0x36916f95 + .long 0x3f7ffff7, 0x36899ecb + .long 0x3f7ffff7, 0x36823575 + .long 0x3f7ffff8, 0x36765ce8 + .long 0x3f7ffff8, 0x366909cc + .long 0x3f7ffff9, 0x365c684a + .long 0x3f7ffff9, 0x36506f88 + .long 0x3f7ffff9, 0x36451713 + .long 0x3f7ffffa, 0x363a56e4 + .long 0x3f7ffffa, 0x36302754 + .long 0x3f7ffffa, 0x36268119 + .long 0x3f7ffffb, 0x361d5d43 + .long 0x3f7ffffb, 0x3614b538 + .long 0x3f7ffffb, 0x360c82b1 + .long 0x3f7ffffc, 0x3604bfb1 + .long 0x3f7ffffc, 0x35facd10 + .long 0x3f7ffffc, 0x35ece39b + .long 0x3f7ffffc, 0x35dfb8b6 + .long 0x3f7ffffd, 0x35d34296 + .long 0x3f7ffffd, 0x35c777ec + .long 0x3f7ffffd, 0x35bc4fdc + .long 0x3f7ffffd, 0x35b1c1fc + .long 0x3f7ffffd, 0x35a7c64b + .long 0x3f7ffffd, 0x359e5531 + .long 0x3f7ffffe, 0x35956771 + .long 0x3f7ffffe, 0x358cf630 + .long 0x3f7ffffe, 0x3584fae8 + .long 0x3f7ffffe, 0x357adecb + .long 0x3f7ffffe, 0x356c9b8f + .long 0x3f7ffffe, 0x355f20ef + .long 0x3f7ffffe, 0x3552644f + .long 0x3f7ffffe, 0x35465b9c + .long 0x3f7fffff, 0x353afd47 + .long 0x3f7fffff, 0x3530403c + .long 0x3f7fffff, 0x35261be0 + .long 0x3f7fffff, 0x351c8807 + .long 0x3f7fffff, 0x35137cf0 + .long 0x3f7fffff, 0x350af341 + .long 0x3f7fffff, 0x3502e402 + .long 0x3f7fffff, 0x34f6912a + .long 0x3f7fffff, 0x34e8356b + .long 0x3f7fffff, 0x34daa8e4 + .long 0x3f7fffff, 0x34cde050 + .long 0x3f7fffff, 0x34c1d100 + .long 0x3f7fffff, 0x34b670d5 + .long 0x3f7fffff, 0x34abb639 + .long 0x3f7fffff, 0x34a19816 + .long 0x3f7fffff, 0x34980dd1 + .long 0x3f7fffff, 0x348f0f43 + .long 0x3f7fffff, 0x348694b3 + .long 0x3f800000, 0x347d2da8 + .long 0x3f800000, 0x346e1d72 + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _AbsMask */ + .align 16 + .long 0x407b8000, 0x407b8000, 0x407b8000, 0x407b8000 /* _MaxThreshold */ + .align 16 + .long 0x47800000, 0x47800000, 0x47800000, 0x47800000 /* _SRound */ + .align 16 + .long 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000 /* _U2THreshold */ + .align 16 + .long 0xbeaaaade, 0xbeaaaade, 0xbeaaaade, 0xbeaaaade /* _poly_3_0 */ + .align 16 + .type __svml_serf_data_internal,@object + .size __svml_serf_data_internal,.-__svml_serf_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core-sse.S new file mode 100644 index 0000000000..4b939f8c55 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized erff, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_erff _ZGVdN8v_erff_sse_wrapper +#include "../svml_s_erff8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core.c new file mode 100644 index 0000000000..50f5901db1 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized erff, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_erff +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_erff, __GI__ZGVdN8v_erff, + __redirect__ZGVdN8v_erff) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core_avx2.S new file mode 100644 index 0000000000..d3ffe10d53 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_erff8_core_avx2.S @@ -0,0 +1,666 @@ +/* Function erff vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Basic formula is + * erf(x) ~ erf(x0) + + * + exp(-x0*x0)*D*(1+c0+T*P1(T)+D^2*P3(T)+D^4*p5) + * where D=x-x0, T=x0*D + * x0 is x rounded to a specified number of fractional bits (in this case 8), + * except that x0=0 for |x|<3.5/256.0 (using x0=0 for first 4 table entries) + * + * Data table packs both erf(x0)_high and a few bits of erf(x0)_low in one + * entry (in place of redundant exponent bits) + * + */ + +/* Offsets for data table __svml_serf_data_internal + */ +#define _erf_tbl 0 +#define _AbsMask 4032 +#define _MaxThreshold 4064 +#define _SRound 4096 +#define _U2Threshold 4128 +#define _poly3_0 4160 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_erff_avx2) + lea -1006632960+__svml_serf_data_internal(%rip), %rax + +/* + * vector gather: + * erf(x0), exp(-x0*x0)*2.0/sqrt(pi) + */ + vmovups _SRound+__svml_serf_data_internal(%rip), %ymm7 + vandps _AbsMask+__svml_serf_data_internal(%rip), %ymm0, %ymm6 + +/* + * erf(x) rounds to 1.0 for x>_MaxThreshold (3.9375) + * can compute all results in the main path + */ + vminps _MaxThreshold+__svml_serf_data_internal(%rip), %ymm6, %ymm8 + vaddps %ymm7, %ymm8, %ymm10 + vcmpgt_oqps _U2Threshold+__svml_serf_data_internal(%rip), %ymm8, %ymm9 + vpslld $3, %ymm10, %ymm11 + vsubps %ymm7, %ymm10, %ymm4 + vsubps %ymm4, %ymm8, %ymm3 + vandps %ymm9, %ymm3, %ymm2 + +/* NaN fixup */ + vminps %ymm6, %ymm3, %ymm3 + +/* D2 = Diff^2 */ + vmulps %ymm2, %ymm2, %ymm2 + +/* save sign */ + vxorps %ymm0, %ymm6, %ymm5 + vmovd %xmm11, %edx + vextractf128 $1, %ymm11, %xmm12 + vpextrd $2, %xmm11, %esi + movslq %edx, %rdx + movslq %esi, %rsi + vmovd %xmm12, %r8d + vmovq (%rax,%rdx), %xmm13 + vmovq (%rax,%rsi), %xmm14 + vunpcklps %xmm14, %xmm13, %xmm10 + vmovups _poly3_0+__svml_serf_data_internal(%rip), %ymm14 + vpextrd $1, %xmm11, %ecx + vpextrd $3, %xmm11, %edi + vpextrd $1, %xmm12, %r9d + vpextrd $2, %xmm12, %r10d + vpextrd $3, %xmm12, %r11d + +/* + * Start polynomial evaluation + * P1 + */ + vfmsub231ps %ymm14, %ymm3, %ymm4 + movslq %ecx, %rcx + movslq %edi, %rdi + movslq %r8d, %r8 + movslq %r9d, %r9 + movslq %r10d, %r10 + movslq %r11d, %r11 + vmovq (%rax,%rcx), %xmm1 + vmovq (%rax,%rdi), %xmm15 + +/* + * branch-free + * (exp_h(x0) * Diff) * (poly + 1.0) + */ + vfmadd213ps %ymm3, %ymm2, %ymm4 + vmovq (%rax,%r8), %xmm7 + vmovq (%rax,%r9), %xmm0 + vmovq (%rax,%r10), %xmm8 + vmovq (%rax,%r11), %xmm9 + vunpcklps %xmm15, %xmm1, %xmm11 + vunpcklps %xmm8, %xmm7, %xmm1 + vunpcklps %xmm9, %xmm0, %xmm0 + vinsertf128 $1, %xmm1, %ymm10, %ymm12 + vinsertf128 $1, %xmm0, %ymm11, %ymm13 + vunpcklps %ymm13, %ymm12, %ymm0 + vunpckhps %ymm13, %ymm12, %ymm15 + +/* Final result */ + vfmadd213ps %ymm0, %ymm15, %ymm4 + +/* set sign */ + vorps %ymm5, %ymm4, %ymm0 + ret + +END(_ZGVdN8v_erff_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_serf_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _erf_tbl[1008][1]; + __declspec(align(32)) VUINT32 _AbsMask[8][1]; + __declspec(align(32)) VUINT32 _MaxThreshold[8][1]; + __declspec(align(32)) VUINT32 _SRound[8][1]; + __declspec(align(32)) VUINT32 _U2Threshold[8][1]; + __declspec(align(32)) VUINT32 _poly3_0[8][1]; +} __svml_serf_data_internal; +#endif +__svml_serf_data_internal: + /*== _erf_tbl ==*/ + .long 0x00000000, 0x3f906ebb + .long 0x3c106dfa, 0x3f906c79 + .long 0x3c906bb8, 0x3f9065b4 + .long 0x3cd89bf0, 0x3f905a6c + .long 0x3d1062b2, 0x3f904aa3 + .long 0x3d3472ea, 0x3f90365a + .long 0x3d587d7f, 0x3f901d93 + .long 0x3d7c8154, 0x3f900050 + .long 0x3d903ea4, 0x3f8fde94 + .long 0x3da2381f, 0x3f8fb862 + .long 0x3db42c8d, 0x3f8f8dbd + .long 0x3dc61b5f, 0x3f8f5eab + .long 0x3dd80409, 0x3f8f2b2e + .long 0x3de9e5fc, 0x3f8ef34c + .long 0x3dfbc0ad, 0x3f8eb70a + .long 0x3e06c9c8, 0x3f8e766e + .long 0x3e0faf0d, 0x3f8e317d + .long 0x3e188fe1, 0x3f8de83e + .long 0x3e216bfe, 0x3f8d9ab9 + .long 0x3e2a4321, 0x3f8d48f3 + .long 0x3e331506, 0x3f8cf2f5 + .long 0x3e3be169, 0x3f8c98c6 + .long 0x3e44a808, 0x3f8c3a6f + .long 0x3e4d68a1, 0x3f8bd7f8 + .long 0x3e5622f2, 0x3f8b716c + .long 0x3e5ed6b9, 0x3f8b06d2 + .long 0x3e6783b7, 0x3f8a9834 + .long 0x3e7029aa, 0x3f8a259e + .long 0x3e78c855, 0x3f89af18 + .long 0x3e80afbc, 0x3f8934af + .long 0x3e84f76b, 0x3f88b66c + .long 0x3e893b19, 0x3f88345d + .long 0x3e8d7aa7, 0x3f87ae8b + .long 0x3e91b5f8, 0x3f872504 + .long 0x3e95ecee, 0x3f8697d3 + .long 0x3e9a1f6b, 0x3f860705 + .long 0x3e9e4d54, 0x3f8572a8 + .long 0x3ea2768c, 0x3f84dac8 + .long 0x3ea69af8, 0x3f843f72 + .long 0x3eaaba7a, 0x3f83a0b6 + .long 0x3eaed4fa, 0x3f82fe9f + .long 0x3eb2ea5c, 0x3f82593e + .long 0x3eb6fa85, 0x3f81b0a0 + .long 0x3ebb055d, 0x3f8104d3 + .long 0x3ebf0aca, 0x3f8055e8 + .long 0x3ec30ab3, 0x3f7f47d8 + .long 0x3ec70501, 0x3f7ddddf + .long 0x3ecaf99b, 0x3f7c6e05 + .long 0x3ecee869, 0x3f7af867 + .long 0x3ed2d156, 0x3f797d26 + .long 0x3ed6b44b, 0x3f77fc62 + .long 0x3eda9132, 0x3f76763c + .long 0x3ede67f6, 0x3f74ead4 + .long 0x3ee23882, 0x3f735a4c + .long 0x3ee602c2, 0x3f71c4c4 + .long 0x3ee9c6a2, 0x3f702a5f + .long 0x3eed840e, 0x3f6e8b3e + .long 0x3ef13af5, 0x3f6ce783 + .long 0x3ef4eb45, 0x3f6b3f51 + .long 0x3ef894ea, 0x3f6992c9 + .long 0x3efc37d5, 0x3f67e20f + .long 0x3effd3f5, 0x3f662d45 + .long 0x3f01b49d, 0x3f64748e + .long 0x3f037bca, 0x3f62b80d + .long 0x3f053f7b, 0x3f60f7e5 + .long 0x3f06ffa8, 0x3f5f3439 + .long 0x3f08bc4a, 0x3f5d6d2d + .long 0x3f0a755a, 0x3f5ba2e3 + .long 0x3f0c2ad3, 0x3f59d57e + .long 0x3f0ddcae, 0x3f580523 + .long 0x3f0f8ae6, 0x3f5631f4 + .long 0x3f113574, 0x3f545c14 + .long 0x3f12dc54, 0x3f5283a7 + .long 0x3f147f81, 0x3f50a8cf + .long 0x3f161ef6, 0x3f4ecbb1 + .long 0x3f17baae, 0x3f4cec6d + .long 0x3f1952a6, 0x3f4b0b28 + .long 0x3f1ae6da, 0x3f492804 + .long 0x3f1c7745, 0x3f474323 + .long 0x3f1e03e5, 0x3f455ca8 + .long 0x3f1f8cb7, 0x3f4374b5 + .long 0x3f2111b7, 0x3f418b6b + .long 0x3f2292e4, 0x3f3fa0ee + .long 0x3f24103a, 0x3f3db55e + .long 0x3f2589b9, 0x3f3bc8dc + .long 0x3f26ff5d, 0x3f39db8a + .long 0x3f287126, 0x3f37ed89 + .long 0x3f29df13, 0x3f35fef8 + .long 0x3f2b4922, 0x3f340ff9 + .long 0x3f2caf53, 0x3f3220ab + .long 0x3f2e11a4, 0x3f30312e + .long 0x3f2f7017, 0x3f2e41a1 + .long 0x3f30caab, 0x3f2c5223 + .long 0x3f322160, 0x3f2a62d3 + .long 0x3f337437, 0x3f2873cf + .long 0x3f34c32f, 0x3f268534 + .long 0x3f360e4c, 0x3f249721 + .long 0x3f37558c, 0x3f22a9b3 + .long 0x3f3898f3, 0x3f20bd06 + .long 0x3f39d881, 0x3f1ed137 + .long 0x3f3b1438, 0x3f1ce661 + .long 0x3f3c4c1b, 0x3f1afca0 + .long 0x3f3d802c, 0x3f19140f + .long 0x3f3eb06c, 0x3f172cc9 + .long 0x3f3fdce0, 0x3f1546e7 + .long 0x3f410589, 0x3f136284 + .long 0x3f422a6b, 0x3f117fb9 + .long 0x3f434b89, 0x3f0f9e9e + .long 0x3f4468e7, 0x3f0dbf4c + .long 0x3f458287, 0x3f0be1db + .long 0x3f46986f, 0x3f0a0662 + .long 0x3f47aaa2, 0x3f082cf7 + .long 0x3f48b925, 0x3f0655b1 + .long 0x3f49c3fb, 0x3f0480a6 + .long 0x3f4acb29, 0x3f02adeb + .long 0x3f4bceb4, 0x3f00dd96 + .long 0x3f4ccea1, 0x3efe1f73 + .long 0x3f4dcaf4, 0x3efa88d5 + .long 0x3f4ec3b4, 0x3ef6f777 + .long 0x3f4fb8e5, 0x3ef36b80 + .long 0x3f50aa8d, 0x3eefe513 + .long 0x3f5198b1, 0x3eec6455 + .long 0x3f528358, 0x3ee8e968 + .long 0x3f536a86, 0x3ee5746d + .long 0x3f544e43, 0x3ee20584 + .long 0x3f552e93, 0x3ede9ccc + .long 0x3f560b7e, 0x3edb3a64 + .long 0x3f56e50a, 0x3ed7de6a + .long 0x3f57bb3d, 0x3ed488f8 + .long 0x3f588e1e, 0x3ed13a2b + .long 0x3f595db4, 0x3ecdf21c + .long 0x3f5a2a05, 0x3ecab0e4 + .long 0x3f5af318, 0x3ec7769b + .long 0x3f5bb8f4, 0x3ec44359 + .long 0x3f5c7ba1, 0x3ec11733 + .long 0x3f5d3b25, 0x3ebdf23d + .long 0x3f5df788, 0x3ebad48d + .long 0x3f5eb0d1, 0x3eb7be35 + .long 0x3f5f6707, 0x3eb4af46 + .long 0x3f601a32, 0x3eb1a7d3 + .long 0x3f60ca59, 0x3eaea7ea + .long 0x3f617784, 0x3eabaf9a + .long 0x3f6221bb, 0x3ea8bef3 + .long 0x3f62c905, 0x3ea5d600 + .long 0x3f636d69, 0x3ea2f4ce + .long 0x3f640ef1, 0x3ea01b68 + .long 0x3f64ada3, 0x3e9d49d9 + .long 0x3f654987, 0x3e9a8029 + .long 0x3f65e2a6, 0x3e97be62 + .long 0x3f667906, 0x3e95048b + .long 0x3f670cb1, 0x3e9252aa + .long 0x3f679dae, 0x3e8fa8c5 + .long 0x3f682c06, 0x3e8d06e3 + .long 0x3f68b7bf, 0x3e8a6d05 + .long 0x3f6940e2, 0x3e87db31 + .long 0x3f69c778, 0x3e855168 + .long 0x3f6a4b88, 0x3e82cfad + .long 0x3f6acd1a, 0x3e805600 + .long 0x3f6b4c36, 0x3e7bc8c2 + .long 0x3f6bc8e5, 0x3e76f5a0 + .long 0x3f6c432f, 0x3e723298 + .long 0x3f6cbb1b, 0x3e6d7fa5 + .long 0x3f6d30b1, 0x3e68dcc1 + .long 0x3f6da3fa, 0x3e6449e7 + .long 0x3f6e14fe, 0x3e5fc70e + .long 0x3f6e83c4, 0x3e5b542b + .long 0x3f6ef055, 0x3e56f136 + .long 0x3f6f5ab8, 0x3e529e21 + .long 0x3f6fc2f5, 0x3e4e5adf + .long 0x3f702915, 0x3e4a2761 + .long 0x3f708d1f, 0x3e460399 + .long 0x3f70ef1b, 0x3e41ef75 + .long 0x3f714f11, 0x3e3deae4 + .long 0x3f71ad09, 0x3e39f5d2 + .long 0x3f72090a, 0x3e36102b + .long 0x3f72631c, 0x3e3239db + .long 0x3f72bb46, 0x3e2e72cb + .long 0x3f731191, 0x3e2abae4 + .long 0x3f736604, 0x3e27120f + .long 0x3f73b8a5, 0x3e237833 + .long 0x3f74097e, 0x3e1fed36 + .long 0x3f745895, 0x3e1c70fd + .long 0x3f74a5f2, 0x3e19036e + .long 0x3f74f19b, 0x3e15a46d + .long 0x3f753b98, 0x3e1253dc + .long 0x3f7583f1, 0x3e0f119f + .long 0x3f75caac, 0x3e0bdd96 + .long 0x3f760fd1, 0x3e08b7a4 + .long 0x3f765366, 0x3e059fa9 + .long 0x3f769573, 0x3e029586 + .long 0x3f76d5fe, 0x3dff3230 + .long 0x3f77150f, 0x3df95481 + .long 0x3f7752ab, 0x3df391b9 + .long 0x3f778eda, 0x3dede995 + .long 0x3f77c9a2, 0x3de85bd0 + .long 0x3f78030a, 0x3de2e825 + .long 0x3f783b18, 0x3ddd8e4c + .long 0x3f7871d3, 0x3dd84dfe + .long 0x3f78a741, 0x3dd326f3 + .long 0x3f78db68, 0x3dce18e3 + .long 0x3f790e50, 0x3dc92385 + .long 0x3f793ffc, 0x3dc4468f + .long 0x3f797075, 0x3dbf81b6 + .long 0x3f799fbf, 0x3dbad4b0 + .long 0x3f79cde1, 0x3db63f32 + .long 0x3f79fae1, 0x3db1c0f1 + .long 0x3f7a26c4, 0x3dad59a1 + .long 0x3f7a518f, 0x3da908f6 + .long 0x3f7a7b4a, 0x3da4cea4 + .long 0x3f7aa3f9, 0x3da0aa5e + .long 0x3f7acba1, 0x3d9c9bd9 + .long 0x3f7af248, 0x3d98a2c7 + .long 0x3f7b17f4, 0x3d94bedd + .long 0x3f7b3ca9, 0x3d90efcd + .long 0x3f7b606e, 0x3d8d354b + .long 0x3f7b8346, 0x3d898f0a + .long 0x3f7ba537, 0x3d85fcbf + .long 0x3f7bc646, 0x3d827e1d + .long 0x3f7be677, 0x3d7e25af + .long 0x3f7c05d1, 0x3d777546 + .long 0x3f7c2456, 0x3d70ea68 + .long 0x3f7c420d, 0x3d6a847d + .long 0x3f7c5ef9, 0x3d6442f0 + .long 0x3f7c7b1f, 0x3d5e252a + .long 0x3f7c9684, 0x3d582a98 + .long 0x3f7cb12b, 0x3d5252a5 + .long 0x3f7ccb1a, 0x3d4c9cbd + .long 0x3f7ce454, 0x3d47084e + .long 0x3f7cfcdd, 0x3d4194c7 + .long 0x3f7d14ba, 0x3d3c4196 + .long 0x3f7d2bef, 0x3d370e2c + .long 0x3f7d427f, 0x3d31f9fb + .long 0x3f7d586f, 0x3d2d0474 + .long 0x3f7d6dc2, 0x3d282d0c + .long 0x3f7d827b, 0x3d237336 + .long 0x3f7d96a0, 0x3d1ed669 + .long 0x3f7daa32, 0x3d1a561b + .long 0x3f7dbd36, 0x3d15f1c6 + .long 0x3f7dcfb0, 0x3d11a8e1 + .long 0x3f7de1a2, 0x3d0d7ae9 + .long 0x3f7df30f, 0x3d09675a + .long 0x3f7e03fd, 0x3d056db0 + .long 0x3f7e146c, 0x3d018d6b + .long 0x3f7e2461, 0x3cfb8c15 + .long 0x3f7e33de, 0x3cf42e22 + .long 0x3f7e42e8, 0x3ced0003 + .long 0x3f7e517f, 0x3ce600c0 + .long 0x3f7e5fa9, 0x3cdf2f67 + .long 0x3f7e6d66, 0x3cd88b05 + .long 0x3f7e7abb, 0x3cd212ad + .long 0x3f7e87aa, 0x3ccbc574 + .long 0x3f7e9435, 0x3cc5a273 + .long 0x3f7ea05f, 0x3cbfa8c4 + .long 0x3f7eac2b, 0x3cb9d786 + .long 0x3f7eb79a, 0x3cb42ddb + .long 0x3f7ec2b1, 0x3caeaae6 + .long 0x3f7ecd71, 0x3ca94dcf + .long 0x3f7ed7dc, 0x3ca415c2 + .long 0x3f7ee1f4, 0x3c9f01ec + .long 0x3f7eebbd, 0x3c9a117f + .long 0x3f7ef537, 0x3c9543ae + .long 0x3f7efe66, 0x3c9097b1 + .long 0x3f7f074b, 0x3c8c0cc2 + .long 0x3f7f0fe8, 0x3c87a21f + .long 0x3f7f1840, 0x3c83570a + .long 0x3f7f2053, 0x3c7e558a + .long 0x3f7f2826, 0x3c763931 + .long 0x3f7f2fb8, 0x3c6e579b + .long 0x3f7f370c, 0x3c66af65 + .long 0x3f7f3e23, 0x3c5f3f2d + .long 0x3f7f4500, 0x3c58059c + .long 0x3f7f4ba4, 0x3c51015f + .long 0x3f7f5211, 0x3c4a3127 + .long 0x3f7f5848, 0x3c4393af + .long 0x3f7f5e4b, 0x3c3d27b5 + .long 0x3f7f641b, 0x3c36ebff + .long 0x3f7f69ba, 0x3c30df57 + .long 0x3f7f6f29, 0x3c2b008e + .long 0x3f7f746a, 0x3c254e7b + .long 0x3f7f797f, 0x3c1fc7fb + .long 0x3f7f7e67, 0x3c1a6bee + .long 0x3f7f8326, 0x3c15393d + .long 0x3f7f87bb, 0x3c102ed6 + .long 0x3f7f8c29, 0x3c0b4bab + .long 0x3f7f9070, 0x3c068eb5 + .long 0x3f7f9492, 0x3c01f6f1 + .long 0x3f7f9890, 0x3bfb06c5 + .long 0x3f7f9c6b, 0x3bf26625 + .long 0x3f7fa024, 0x3bea0a1d + .long 0x3f7fa3bc, 0x3be1f0d3 + .long 0x3f7fa734, 0x3bda1876 + .long 0x3f7faa8d, 0x3bd27f42 + .long 0x3f7fadc8, 0x3bcb237a + .long 0x3f7fb0e6, 0x3bc4036c + .long 0x3f7fb3e8, 0x3bbd1d6f + .long 0x3f7fb6cf, 0x3bb66fe6 + .long 0x3f7fb99c, 0x3baff93b + .long 0x3f7fbc4f, 0x3ba9b7e1 + .long 0x3f7fbeea, 0x3ba3aa56 + .long 0x3f7fc16d, 0x3b9dcf20 + .long 0x3f7fc3d9, 0x3b9824ce + .long 0x3f7fc62e, 0x3b92a9f7 + .long 0x3f7fc86e, 0x3b8d5d3c + .long 0x3f7fca99, 0x3b883d46 + .long 0x3f7fccb0, 0x3b8348c6 + .long 0x3f7fceb4, 0x3b7cfce8 + .long 0x3f7fd0a5, 0x3b73ba24 + .long 0x3f7fd283, 0x3b6ac6d3 + .long 0x3f7fd450, 0x3b622096 + .long 0x3f7fd60c, 0x3b59c51d + .long 0x3f7fd7b7, 0x3b51b22a + .long 0x3f7fd953, 0x3b49e589 + .long 0x3f7fdadf, 0x3b425d18 + .long 0x3f7fdc5c, 0x3b3b16c2 + .long 0x3f7fddcc, 0x3b341080 + .long 0x3f7fdf2d, 0x3b2d4858 + .long 0x3f7fe081, 0x3b26bc5e + .long 0x3f7fe1c8, 0x3b206ab2 + .long 0x3f7fe303, 0x3b1a5183 + .long 0x3f7fe431, 0x3b146f09 + .long 0x3f7fe554, 0x3b0ec18c + .long 0x3f7fe66c, 0x3b09475d + .long 0x3f7fe77a, 0x3b03feda + .long 0x3f7fe87d, 0x3afdccdc + .long 0x3f7fe975, 0x3af3f919 + .long 0x3f7fea65, 0x3aea7f6c + .long 0x3f7feb4b, 0x3ae15ce8 + .long 0x3f7fec27, 0x3ad88eb8 + .long 0x3f7fecfc, 0x3ad0121b + .long 0x3f7fedc8, 0x3ac7e464 + .long 0x3f7fee8c, 0x3ac002f8 + .long 0x3f7fef48, 0x3ab86b52 + .long 0x3f7feffd, 0x3ab11afe + .long 0x3f7ff0aa, 0x3aaa0f9a + .long 0x3f7ff151, 0x3aa346d7 + .long 0x3f7ff1f1, 0x3a9cbe77 + .long 0x3f7ff28a, 0x3a96744c + .long 0x3f7ff31e, 0x3a90663b + .long 0x3f7ff3ab, 0x3a8a9237 + .long 0x3f7ff433, 0x3a84f643 + .long 0x3f7ff4b5, 0x3a7f20e7 + .long 0x3f7ff532, 0x3a74bdd2 + .long 0x3f7ff5aa, 0x3a6abfa9 + .long 0x3f7ff61d, 0x3a6122ea + .long 0x3f7ff68b, 0x3a57e42f + .long 0x3f7ff6f5, 0x3a4f002c + .long 0x3f7ff75a, 0x3a4673af + .long 0x3f7ff7bb, 0x3a3e3ba2 + .long 0x3f7ff819, 0x3a365507 + .long 0x3f7ff872, 0x3a2ebcf6 + .long 0x3f7ff8c7, 0x3a2770a1 + .long 0x3f7ff919, 0x3a206d52 + .long 0x3f7ff968, 0x3a19b066 + .long 0x3f7ff9b3, 0x3a133754 + .long 0x3f7ff9fb, 0x3a0cffa3 + .long 0x3f7ffa40, 0x3a0706f4 + .long 0x3f7ffa82, 0x3a014af8 + .long 0x3f7ffac1, 0x39f792ea + .long 0x3f7ffafe, 0x39ed0088 + .long 0x3f7ffb38, 0x39e2daa1 + .long 0x3f7ffb6f, 0x39d91d2d + .long 0x3f7ffba5, 0x39cfc44a + .long 0x3f7ffbd7, 0x39c6cc35 + .long 0x3f7ffc08, 0x39be314d + .long 0x3f7ffc36, 0x39b5f011 + .long 0x3f7ffc63, 0x39ae051c + .long 0x3f7ffc8e, 0x39a66d2a + .long 0x3f7ffcb6, 0x399f2512 + .long 0x3f7ffcdd, 0x399829c8 + .long 0x3f7ffd02, 0x3991785a + .long 0x3f7ffd26, 0x398b0df2 + .long 0x3f7ffd48, 0x3984e7d2 + .long 0x3f7ffd68, 0x397e06ab + .long 0x3f7ffd87, 0x3972bbde + .long 0x3f7ffda5, 0x3967ea53 + .long 0x3f7ffdc1, 0x395d8d4b + .long 0x3f7ffddc, 0x3953a034 + .long 0x3f7ffdf6, 0x394a1ea5 + .long 0x3f7ffe0f, 0x3941045e + .long 0x3f7ffe27, 0x39384d47 + .long 0x3f7ffe3d, 0x392ff56d + .long 0x3f7ffe53, 0x3927f904 + .long 0x3f7ffe67, 0x39205461 + .long 0x3f7ffe7b, 0x391903fe + .long 0x3f7ffe8d, 0x39120475 + .long 0x3f7ffe9f, 0x390b5281 + .long 0x3f7ffeb0, 0x3904eafc + .long 0x3f7ffec0, 0x38fd95bd + .long 0x3f7ffed0, 0x38f1de7a + .long 0x3f7ffedf, 0x38e6aa94 + .long 0x3f7ffeed, 0x38dbf4a3 + .long 0x3f7ffefa, 0x38d1b776 + .long 0x3f7fff07, 0x38c7ee0e + .long 0x3f7fff13, 0x38be939c + .long 0x3f7fff1f, 0x38b5a381 + .long 0x3f7fff2a, 0x38ad194e + .long 0x3f7fff34, 0x38a4f0bc + .long 0x3f7fff3f, 0x389d25b0 + .long 0x3f7fff48, 0x3895b43b + .long 0x3f7fff51, 0x388e9890 + .long 0x3f7fff5a, 0x3887cf0e + .long 0x3f7fff62, 0x38815434 + .long 0x3f7fff6a, 0x3876494d + .long 0x3f7fff72, 0x386a7a5a + .long 0x3f7fff79, 0x385f355e + .long 0x3f7fff80, 0x38547466 + .long 0x3f7fff86, 0x384a31bf + .long 0x3f7fff8c, 0x384067ee + .long 0x3f7fff92, 0x383711b4 + .long 0x3f7fff98, 0x382e2a06 + .long 0x3f7fff9d, 0x3825ac0e + .long 0x3f7fffa2, 0x381d9329 + .long 0x3f7fffa7, 0x3815dae6 + .long 0x3f7fffab, 0x380e7f01 + .long 0x3f7fffb0, 0x38077b62 + .long 0x3f7fffb4, 0x3800cc21 + .long 0x3f7fffb8, 0x37f4daf4 + .long 0x3f7fffbc, 0x37e8b7ac + .long 0x3f7fffbf, 0x37dd2782 + .long 0x3f7fffc2, 0x37d223dc + .long 0x3f7fffc6, 0x37c7a666 + .long 0x3f7fffc9, 0x37bda912 + .long 0x3f7fffcc, 0x37b42611 + .long 0x3f7fffce, 0x37ab17d6 + .long 0x3f7fffd1, 0x37a2790f + .long 0x3f7fffd3, 0x379a44a5 + .long 0x3f7fffd6, 0x379275b9 + .long 0x3f7fffd8, 0x378b07a2 + .long 0x3f7fffda, 0x3783f5e9 + .long 0x3f7fffdc, 0x377a7897 + .long 0x3f7fffde, 0x376dad68 + .long 0x3f7fffe0, 0x37618278 + .long 0x3f7fffe2, 0x3755f04f + .long 0x3f7fffe3, 0x374aefcc + .long 0x3f7fffe5, 0x37407a1d + .long 0x3f7fffe6, 0x373688bc + .long 0x3f7fffe8, 0x372d1570 + .long 0x3f7fffe9, 0x37241a44 + .long 0x3f7fffea, 0x371b9188 + .long 0x3f7fffeb, 0x371375cf + .long 0x3f7fffec, 0x370bc1e7 + .long 0x3f7fffee, 0x370470dd + .long 0x3f7fffef, 0x36fafbec + .long 0x3f7fffef, 0x36edc95b + .long 0x3f7ffff0, 0x36e14167 + .long 0x3f7ffff1, 0x36d55bd6 + .long 0x3f7ffff2, 0x36ca10ce + .long 0x3f7ffff3, 0x36bf58d1 + .long 0x3f7ffff4, 0x36b52cb9 + .long 0x3f7ffff4, 0x36ab85b5 + .long 0x3f7ffff5, 0x36a25d43 + .long 0x3f7ffff5, 0x3699ad31 + .long 0x3f7ffff6, 0x36916f95 + .long 0x3f7ffff7, 0x36899ecb + .long 0x3f7ffff7, 0x36823575 + .long 0x3f7ffff8, 0x36765ce8 + .long 0x3f7ffff8, 0x366909cc + .long 0x3f7ffff9, 0x365c684a + .long 0x3f7ffff9, 0x36506f88 + .long 0x3f7ffff9, 0x36451713 + .long 0x3f7ffffa, 0x363a56e4 + .long 0x3f7ffffa, 0x36302754 + .long 0x3f7ffffa, 0x36268119 + .long 0x3f7ffffb, 0x361d5d43 + .long 0x3f7ffffb, 0x3614b538 + .long 0x3f7ffffb, 0x360c82b1 + .long 0x3f7ffffc, 0x3604bfb1 + .long 0x3f7ffffc, 0x35facd10 + .long 0x3f7ffffc, 0x35ece39b + .long 0x3f7ffffc, 0x35dfb8b6 + .long 0x3f7ffffd, 0x35d34296 + .long 0x3f7ffffd, 0x35c777ec + .long 0x3f7ffffd, 0x35bc4fdc + .long 0x3f7ffffd, 0x35b1c1fc + .long 0x3f7ffffd, 0x35a7c64b + .long 0x3f7ffffd, 0x359e5531 + .long 0x3f7ffffe, 0x35956771 + .long 0x3f7ffffe, 0x358cf630 + .long 0x3f7ffffe, 0x3584fae8 + .long 0x3f7ffffe, 0x357adecb + .long 0x3f7ffffe, 0x356c9b8f + .long 0x3f7ffffe, 0x355f20ef + .long 0x3f7ffffe, 0x3552644f + .long 0x3f7ffffe, 0x35465b9c + .long 0x3f7fffff, 0x353afd47 + .long 0x3f7fffff, 0x3530403c + .long 0x3f7fffff, 0x35261be0 + .long 0x3f7fffff, 0x351c8807 + .long 0x3f7fffff, 0x35137cf0 + .long 0x3f7fffff, 0x350af341 + .long 0x3f7fffff, 0x3502e402 + .long 0x3f7fffff, 0x34f6912a + .long 0x3f7fffff, 0x34e8356b + .long 0x3f7fffff, 0x34daa8e4 + .long 0x3f7fffff, 0x34cde050 + .long 0x3f7fffff, 0x34c1d100 + .long 0x3f7fffff, 0x34b670d5 + .long 0x3f7fffff, 0x34abb639 + .long 0x3f7fffff, 0x34a19816 + .long 0x3f7fffff, 0x34980dd1 + .long 0x3f7fffff, 0x348f0f43 + .long 0x3f7fffff, 0x348694b3 + .long 0x3f800000, 0x347d2da8 + .long 0x3f800000, 0x346e1d72 + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _AbsMask */ + .align 32 + .long 0x407b8000, 0x407b8000, 0x407b8000, 0x407b8000, 0x407b8000, 0x407b8000, 0x407b8000, 0x407b8000 /* _MaxThreshold */ + .align 32 + .long 0x47800000, 0x47800000, 0x47800000, 0x47800000, 0x47800000, 0x47800000, 0x47800000, 0x47800000 /* _SRound */ + .align 32 + .long 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000, 0x2f800000 /* _U2THreshold */ + .align 32 + .long 0xbeaaaade, 0xbeaaaade, 0xbeaaaade, 0xbeaaaade, 0xbeaaaade, 0xbeaaaade, 0xbeaaaade, 0xbeaaaade /* _poly_3_0 */ + .align 32 + .type __svml_serf_data_internal,@object + .size __svml_serf_data_internal,.-__svml_serf_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_erf2_core.S b/sysdeps/x86_64/fpu/svml_d_erf2_core.S new file mode 100644 index 0000000000..6ef30af2bd --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_erf2_core.S @@ -0,0 +1,29 @@ +/* Function erf vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_erf) +WRAPPER_IMPL_SSE2 erf +END (_ZGVbN2v_erf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_erf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_erf4_core.S b/sysdeps/x86_64/fpu/svml_d_erf4_core.S new file mode 100644 index 0000000000..2ca8dfe92e --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_erf4_core.S @@ -0,0 +1,29 @@ +/* Function erf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_erf) +WRAPPER_IMPL_AVX _ZGVbN2v_erf +END (_ZGVdN4v_erf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_erf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_erf4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_erf4_core_avx.S new file mode 100644 index 0000000000..264ff09459 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_erf4_core_avx.S @@ -0,0 +1,25 @@ +/* Function erf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_erf) +WRAPPER_IMPL_AVX _ZGVbN2v_erf +END (_ZGVcN4v_erf) diff --git a/sysdeps/x86_64/fpu/svml_d_erf8_core.S b/sysdeps/x86_64/fpu/svml_d_erf8_core.S new file mode 100644 index 0000000000..de8c2a48bb --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_erf8_core.S @@ -0,0 +1,25 @@ +/* Function erf vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_erf) +WRAPPER_IMPL_AVX512 _ZGVdN4v_erf +END (_ZGVeN8v_erf) diff --git a/sysdeps/x86_64/fpu/svml_s_erff16_core.S b/sysdeps/x86_64/fpu/svml_s_erff16_core.S new file mode 100644 index 0000000000..2c5037a0ec --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_erff16_core.S @@ -0,0 +1,25 @@ +/* Function erff vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_erff) +WRAPPER_IMPL_AVX512 _ZGVdN8v_erff +END (_ZGVeN16v_erff) diff --git a/sysdeps/x86_64/fpu/svml_s_erff4_core.S b/sysdeps/x86_64/fpu/svml_s_erff4_core.S new file mode 100644 index 0000000000..0f58bb7aaf --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_erff4_core.S @@ -0,0 +1,29 @@ +/* Function erff vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_erff) +WRAPPER_IMPL_SSE2 erff +END (_ZGVbN4v_erff) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_erff) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_erff8_core.S b/sysdeps/x86_64/fpu/svml_s_erff8_core.S new file mode 100644 index 0000000000..a9f287c420 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_erff8_core.S @@ -0,0 +1,29 @@ +/* Function erff vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_erff) +WRAPPER_IMPL_AVX _ZGVbN4v_erff +END (_ZGVdN8v_erff) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_erff) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_erff8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_erff8_core_avx.S new file mode 100644 index 0000000000..ca5a8048e8 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_erff8_core_avx.S @@ -0,0 +1,25 @@ +/* Function erff vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_erff) +WRAPPER_IMPL_AVX _ZGVbN4v_erff +END (_ZGVcN8v_erff) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx.c new file mode 100644 index 0000000000..a2eceefc9b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-erf.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx2.c new file mode 100644 index 0000000000..a2eceefc9b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-erf.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx512f.c new file mode 100644 index 0000000000..a2eceefc9b --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-erf-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-erf.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-erf.c b/sysdeps/x86_64/fpu/test-double-libmvec-erf.c new file mode 100644 index 0000000000..c1ded24b1d --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-erf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC erf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index db7ae3e7a6..9d91ccfe51 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVbN2v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVbN2v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVbN2v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVbN2v_acosh) +VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVbN2v_erf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 269ae38f67..9e86d5fef8 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -46,6 +46,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVdN4v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVdN4v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVdN4v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVdN4v_acosh) +VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVdN4v_erf) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index d95b960a45..0f4ef00de4 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVcN4v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVcN4v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVcN4v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVcN4v_acosh) +VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVcN4v_erf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index a22f08b5f8..975dff85af 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2), _ZGVeN8v_log2) VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVeN8v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVeN8v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVeN8v_acosh) +VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVeN8v_erf) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx.c new file mode 100644 index 0000000000..8cdf4dc069 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-erff.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx2.c new file mode 100644 index 0000000000..8cdf4dc069 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-erff.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx512f.c new file mode 100644 index 0000000000..8cdf4dc069 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-erff-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-erff.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-erff.c b/sysdeps/x86_64/fpu/test-float-libmvec-erff.c new file mode 100644 index 0000000000..ba83826ab9 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-erff.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC erff +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 7982ae2c84..2b1e27391a 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVeN16v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVeN16v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVeN16v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVeN16v_acoshf) +VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVeN16v_erff) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index bdfcbea2cd..78428bf517 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVbN4v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVbN4v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVbN4v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVbN4v_acoshf) +VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVbN4v_erff) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index 7b3ba81441..dadd4e6ca0 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -46,6 +46,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVdN8v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVdN8v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVdN8v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVdN8v_acoshf) +VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVdN8v_erff) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index a13d2e4ca1..7b2d583e54 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -43,6 +43,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log2f), _ZGVcN8v_log2f) VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVcN8v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVcN8v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVcN8v_acoshf) +VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVcN8v_erff) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573828 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=uxOIqfJS; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmQH5ps0z9sPC for ; Wed, 29 Dec 2021 07:31:03 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 959A33858D39 for ; Tue, 28 Dec 2021 20:31:01 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 959A33858D39 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640723461; bh=SQbUOT7gVc9ed+vJ1fz4lRpIvLyZkHNlP1PvK+xMI10=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=uxOIqfJSruEWX2l0R4xDKwuWjy7XbOgq6SZlQzaY8gKvJw/8ypXv8Tdsgr2Yc9TR9 2EFaulE92OV9rzKNsQYE114RojxTC5UnVbkGouU+pUIq/ygdrHpRYFMAe6mWclEJuk r00o3epgqg3Xb2pU2j6MPvRIsCOKdY763XgUrBB4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by sourceware.org (Postfix) with ESMTPS id B2E4F385843D for ; Tue, 28 Dec 2021 20:11:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B2E4F385843D X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="238958496" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="238958496" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,243,1635231600"; d="scan'208";a="468218892" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga003.jf.intel.com with ESMTP; 28 Dec 2021 12:11:33 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsl016522; Tue, 28 Dec 2021 12:11:32 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 17/18] x86-64: Add vector tanh/tanhf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:29 -0800 Message-Id: <20211228201130.737370-18-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LOTSOFHASH, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized tanh/tanhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector tanh/tanhf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 15 + .../fpu/multiarch/svml_d_tanh2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_tanh2_core.c | 27 + .../fpu/multiarch/svml_d_tanh2_core_sse4.S | 1272 ++++++++++++++++ .../fpu/multiarch/svml_d_tanh4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_tanh4_core.c | 27 + .../fpu/multiarch/svml_d_tanh4_core_avx2.S | 1279 +++++++++++++++++ .../fpu/multiarch/svml_d_tanh8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_tanh8_core.c | 27 + .../fpu/multiarch/svml_d_tanh8_core_avx512.S | 472 ++++++ .../fpu/multiarch/svml_s_tanhf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_tanhf16_core.c | 28 + .../multiarch/svml_s_tanhf16_core_avx512.S | 381 +++++ .../fpu/multiarch/svml_s_tanhf4_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_s_tanhf4_core.c | 28 + .../fpu/multiarch/svml_s_tanhf4_core_sse4.S | 832 +++++++++++ .../fpu/multiarch/svml_s_tanhf8_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_s_tanhf8_core.c | 28 + .../fpu/multiarch/svml_s_tanhf8_core_avx2.S | 844 +++++++++++ sysdeps/x86_64/fpu/svml_d_tanh2_core.S | 29 + sysdeps/x86_64/fpu/svml_d_tanh4_core.S | 29 + sysdeps/x86_64/fpu/svml_d_tanh4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_tanh8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_tanhf16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_tanhf4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_tanhf8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_tanhf8_core_avx.S | 25 + .../x86_64/fpu/test-double-libmvec-tanh-avx.c | 1 + .../fpu/test-double-libmvec-tanh-avx2.c | 1 + .../fpu/test-double-libmvec-tanh-avx512f.c | 1 + sysdeps/x86_64/fpu/test-double-libmvec-tanh.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../x86_64/fpu/test-float-libmvec-tanhf-avx.c | 1 + .../fpu/test-float-libmvec-tanhf-avx2.c | 1 + .../fpu/test-float-libmvec-tanhf-avx512f.c | 1 + sysdeps/x86_64/fpu/test-float-libmvec-tanhf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 5647 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_tanh2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_tanh4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_tanh4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_tanh8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_tanhf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_tanhf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_tanhf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_tanhf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-tanh.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-tanhf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 33d480031b..21f1a43232 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -285,4 +285,15 @@ #define __DECL_SIMD_erff32x #define __DECL_SIMD_erff64x #define __DECL_SIMD_erff128x + +#define __DECL_SIMD_tanh +#define __DECL_SIMD_tanhf +#define __DECL_SIMD_tanhl +#define __DECL_SIMD_tanhf16 +#define __DECL_SIMD_tanhf32 +#define __DECL_SIMD_tanhf64 +#define __DECL_SIMD_tanhf128 +#define __DECL_SIMD_tanhf32x +#define __DECL_SIMD_tanhf64x +#define __DECL_SIMD_tanhf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index a5b6c4457f..3d1c2056d5 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -72,7 +72,7 @@ __MATHCALL_VEC (cosh,, (_Mdouble_ __x)); /* Hyperbolic sine of X. */ __MATHCALL_VEC (sinh,, (_Mdouble_ __x)); /* Hyperbolic tangent of X. */ -__MATHCALL (tanh,, (_Mdouble_ __x)); +__MATHCALL_VEC (tanh,, (_Mdouble_ __x)); #ifdef __USE_GNU /* Cosine and sine of X. */ diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index 5525c8a0d6..e178cef683 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -61,6 +61,7 @@ GLIBC_2.35 _ZGVbN2v_log10 F GLIBC_2.35 _ZGVbN2v_log1p F GLIBC_2.35 _ZGVbN2v_log2 F GLIBC_2.35 _ZGVbN2v_sinh F +GLIBC_2.35 _ZGVbN2v_tanh F GLIBC_2.35 _ZGVbN2vv_atan2 F GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F @@ -78,6 +79,7 @@ GLIBC_2.35 _ZGVbN4v_log10f F GLIBC_2.35 _ZGVbN4v_log1pf F GLIBC_2.35 _ZGVbN4v_log2f F GLIBC_2.35 _ZGVbN4v_sinhf F +GLIBC_2.35 _ZGVbN4v_tanhf F GLIBC_2.35 _ZGVbN4vv_atan2f F GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F @@ -95,6 +97,7 @@ GLIBC_2.35 _ZGVcN4v_log10 F GLIBC_2.35 _ZGVcN4v_log1p F GLIBC_2.35 _ZGVcN4v_log2 F GLIBC_2.35 _ZGVcN4v_sinh F +GLIBC_2.35 _ZGVcN4v_tanh F GLIBC_2.35 _ZGVcN4vv_atan2 F GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F @@ -112,6 +115,7 @@ GLIBC_2.35 _ZGVcN8v_log10f F GLIBC_2.35 _ZGVcN8v_log1pf F GLIBC_2.35 _ZGVcN8v_log2f F GLIBC_2.35 _ZGVcN8v_sinhf F +GLIBC_2.35 _ZGVcN8v_tanhf F GLIBC_2.35 _ZGVcN8vv_atan2f F GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F @@ -129,6 +133,7 @@ GLIBC_2.35 _ZGVdN4v_log10 F GLIBC_2.35 _ZGVdN4v_log1p F GLIBC_2.35 _ZGVdN4v_log2 F GLIBC_2.35 _ZGVdN4v_sinh F +GLIBC_2.35 _ZGVdN4v_tanh F GLIBC_2.35 _ZGVdN4vv_atan2 F GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F @@ -146,6 +151,7 @@ GLIBC_2.35 _ZGVdN8v_log10f F GLIBC_2.35 _ZGVdN8v_log1pf F GLIBC_2.35 _ZGVdN8v_log2f F GLIBC_2.35 _ZGVdN8v_sinhf F +GLIBC_2.35 _ZGVdN8v_tanhf F GLIBC_2.35 _ZGVdN8vv_atan2f F GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F @@ -163,6 +169,7 @@ GLIBC_2.35 _ZGVeN16v_log10f F GLIBC_2.35 _ZGVeN16v_log1pf F GLIBC_2.35 _ZGVeN16v_log2f F GLIBC_2.35 _ZGVeN16v_sinhf F +GLIBC_2.35 _ZGVeN16v_tanhf F GLIBC_2.35 _ZGVeN16vv_atan2f F GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F @@ -180,5 +187,6 @@ GLIBC_2.35 _ZGVeN8v_log10 F GLIBC_2.35 _ZGVeN8v_log1p F GLIBC_2.35 _ZGVeN8v_log2 F GLIBC_2.35 _ZGVeN8v_sinh F +GLIBC_2.35 _ZGVeN8v_tanh F GLIBC_2.35 _ZGVeN8vv_atan2 F GLIBC_2.35 _ZGVeN8vv_hypot F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index ea0deb31c1..3c657f6108 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -126,6 +126,10 @@ # define __DECL_SIMD_erf __DECL_SIMD_x86_64 # undef __DECL_SIMD_erff # define __DECL_SIMD_erff __DECL_SIMD_x86_64 +# undef __DECL_SIMD_tanh +# define __DECL_SIMD_tanh __DECL_SIMD_x86_64 +# undef __DECL_SIMD_tanhf +# define __DECL_SIMD_tanhf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index 42addd9a25..c7f81945fe 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -62,6 +62,8 @@ !GCC$ builtin (acoshf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (erf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (erff) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (tanh) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (tanhf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -109,3 +111,5 @@ !GCC$ builtin (acoshf) attributes simd (notinbranch) if('x32') !GCC$ builtin (erf) attributes simd (notinbranch) if('x32') !GCC$ builtin (erff) attributes simd (notinbranch) if('x32') +!GCC$ builtin (tanh) attributes simd (notinbranch) if('x32') +!GCC$ builtin (tanhf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 2b89a1bba3..26df8d47bf 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -45,6 +45,7 @@ libmvec-funcs = \ sin \ sincos \ sinh \ + tanh \ # Define libmvec function for benchtests directory. libmvec-bench-funcs = \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index 2fcdef6944..adcbe0fefb 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -29,6 +29,7 @@ libmvec { _ZGVbN2v_log1p; _ZGVcN4v_log1p; _ZGVdN4v_log1p; _ZGVeN8v_log1p; _ZGVbN2v_log2; _ZGVcN4v_log2; _ZGVdN4v_log2; _ZGVeN8v_log2; _ZGVbN2v_sinh; _ZGVcN4v_sinh; _ZGVdN4v_sinh; _ZGVeN8v_sinh; + _ZGVbN2v_tanh; _ZGVcN4v_tanh; _ZGVdN4v_tanh; _ZGVeN8v_tanh; _ZGVbN2vv_atan2; _ZGVcN4vv_atan2; _ZGVdN4vv_atan2; _ZGVeN8vv_atan2; _ZGVbN2vv_hypot; _ZGVcN4vv_hypot; _ZGVdN4vv_hypot; _ZGVeN8vv_hypot; _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; @@ -46,6 +47,7 @@ libmvec { _ZGVbN4v_log1pf; _ZGVcN8v_log1pf; _ZGVdN8v_log1pf; _ZGVeN16v_log1pf; _ZGVbN4v_log2f; _ZGVcN8v_log2f; _ZGVdN8v_log2f; _ZGVeN16v_log2f; _ZGVbN4v_sinhf; _ZGVcN8v_sinhf; _ZGVdN8v_sinhf; _ZGVeN16v_sinhf; + _ZGVbN4v_tanhf; _ZGVcN8v_tanhf; _ZGVdN8v_tanhf; _ZGVeN16v_tanhf; _ZGVbN4vv_atan2f; _ZGVcN8vv_atan2f; _ZGVdN8vv_atan2f; _ZGVeN16vv_atan2f; _ZGVbN4vv_hypotf; _ZGVcN8vv_hypotf; _ZGVdN8vv_hypotf; _ZGVeN16vv_hypotf; } diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index 929de0e786..bfaad7acef 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -2067,6 +2067,21 @@ float: 3 float128: 3 ldouble: 4 +Function: "tanh_vlen16": +float: 1 + +Function: "tanh_vlen2": +double: 1 + +Function: "tanh_vlen4": +double: 1 + +Function: "tanh_vlen4_avx2": +double: 1 + +Function: "tanh_vlen8": +double: 1 + Function: "tgamma": double: 9 float: 8 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core-sse2.S new file mode 100644 index 0000000000..35b065fe55 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized tanh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_tanh _ZGVbN2v_tanh_sse2 +#include "../svml_d_tanh2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core.c new file mode 100644 index 0000000000..d2e63bdc56 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized tanh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_tanh +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_tanh, __GI__ZGVbN2v_tanh, __redirect__ZGVbN2v_tanh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core_sse4.S new file mode 100644 index 0000000000..35bbb5b04c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh2_core_sse4.S @@ -0,0 +1,1272 @@ +/* Function tanh vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * NOTE: Since the hyperbolic tangent function is odd + * (tanh(x) = -tanh(-x)), below algorithm deals with the absolute + * value of the argument |x|: tanh(x) = sign(x) * tanh(|x|) + * + * We use a table lookup method to compute tanh(|x|). + * The basic idea is to split the input range into a number of subintervals + * and to approximate tanh(.) with a polynomial on each of them. + * + * IEEE SPECIAL CONDITIONS: + * x = [+,-]0, r = [+,-]0 + * x = +Inf, r = +1 + * x = -Inf, r = -1 + * x = QNaN, r = QNaN + * x = SNaN, r = QNaN + * + * + * ALGORITHM DETAILS + * We handle special values in a callout function, aside from main path + * computations. "Special" for this algorithm are: + * INF, NAN, |x| > HUGE_THRESHOLD + * + * + * Main path computations are organized as follows: + * Actually we split the interval [0, SATURATION_THRESHOLD) + * into a number of subintervals. On each subinterval we approximate tanh(.) + * with a minimax polynomial of pre-defined degree. Polynomial coefficients + * are computed beforehand and stored in table. We also use + * + * y := |x| + B, + * + * here B depends on subinterval and is used to make argument + * closer to zero. + * We also add large fake interval [SATURATION_THRESHOLD, HUGE_THRESHOLD], + * where 1.0 + 0.0*y + 0.0*y^2 ... coefficients are stored - just to + * preserve main path computation logic but return 1.0 for all arguments. + * + * Hence reconstruction looks as follows: + * we extract proper polynomial and range reduction coefficients + * (Pj and B), corresponding to subinterval, to which |x| belongs, + * and return + * + * r := sign(x) * (P0 + P1 * y + ... + Pn * y^n) + * + * NOTE: we use multiprecision technique to multiply and sum the first + * K terms of the polynomial. So Pj, j = 0..K are stored in + * table each as a pair of target precision numbers (Pj and PLj) to + * achieve wider than target precision. + * + * + */ + +/* Offsets for data table __svml_dtanh_data_internal + */ +#define _dbP 0 +#define _dbSignMask 7680 +#define _dbAbsMask 7696 +#define _iExpMantMask 7712 +#define _iExpMask 7728 +#define _iMinIdxOfsMask 7744 +#define _iMaxIdxMask 7760 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_tanh_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm13 + movq _iExpMantMask+__svml_dtanh_data_internal(%rip), %xmm14 + lea _dbP+96+__svml_dtanh_data_internal(%rip), %rsi + pshufd $221, %xmm13, %xmm8 + +/* if VMIN, VMAX is defined for I type */ + pxor %xmm10, %xmm10 + movq _iMinIdxOfsMask+__svml_dtanh_data_internal(%rip), %xmm9 + +/* Here huge arguments, INF and NaNs are filtered out to callout. */ + pand %xmm14, %xmm8 + movdqa %xmm8, %xmm11 + psubd %xmm9, %xmm8 + movq _iMaxIdxMask+__svml_dtanh_data_internal(%rip), %xmm5 + movdqa %xmm8, %xmm6 + movdqa %xmm8, %xmm7 + pcmpgtd %xmm5, %xmm6 + pcmpgtd %xmm10, %xmm7 + movdqa %xmm6, %xmm3 + pand %xmm7, %xmm8 + andps %xmm6, %xmm5 + andnps %xmm8, %xmm3 + orps %xmm5, %xmm3 + +/* + * VSHRIMM( I, iIndex, = iIndex, (17 - 4) ); + * VGATHER_MATRIX( L2D, p, TAB._dbP, iIndex, 0, T_ITEM_SIZE, T_ITEM_GRAN, 13, 0, 0 ); + */ + psrld $10, %xmm3 + movd %xmm3, %eax + pshufd $1, %xmm3, %xmm4 + +/* Constant loading */ + movq _iExpMask+__svml_dtanh_data_internal(%rip), %xmm15 + movd %xmm4, %ecx + pcmpgtd %xmm15, %xmm11 + movmskps %xmm11, %edx + movups _dbAbsMask+__svml_dtanh_data_internal(%rip), %xmm0 + movups _dbSignMask+__svml_dtanh_data_internal(%rip), %xmm12 + andps %xmm13, %xmm0 + movslq %eax, %rax + andps %xmm13, %xmm12 + movslq %ecx, %rcx + movups %xmm13, (%rsp) + movups -96(%rax,%rsi), %xmm11 + movups -96(%rcx,%rsi), %xmm2 + movups -80(%rax,%rsi), %xmm9 + movups -48(%rax,%rsi), %xmm5 + movaps %xmm9, %xmm10 + movups -32(%rax,%rsi), %xmm3 + movaps %xmm5, %xmm6 + movaps %xmm3, %xmm4 + unpckhpd %xmm2, %xmm11 + movups -80(%rcx,%rsi), %xmm13 + movups -48(%rcx,%rsi), %xmm15 + movups -32(%rcx,%rsi), %xmm1 + movups -64(%rax,%rsi), %xmm7 + movups -16(%rax,%rsi), %xmm2 + movaps %xmm7, %xmm8 + unpcklpd %xmm13, %xmm10 + unpckhpd %xmm13, %xmm9 + movups -64(%rcx,%rsi), %xmm14 + movups -16(%rcx,%rsi), %xmm13 + unpcklpd %xmm15, %xmm6 + unpckhpd %xmm15, %xmm5 + unpcklpd %xmm1, %xmm4 + unpckhpd %xmm1, %xmm3 + movaps %xmm2, %xmm1 + movups (%rax,%rsi), %xmm15 + unpcklpd %xmm14, %xmm8 + unpckhpd %xmm14, %xmm7 + unpcklpd %xmm13, %xmm1 + unpckhpd %xmm13, %xmm2 + movaps %xmm15, %xmm13 + movups (%rcx,%rsi), %xmm14 + unpcklpd %xmm14, %xmm13 + addpd %xmm13, %xmm0 + mulpd %xmm0, %xmm2 + addpd %xmm1, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm3, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm4, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm5, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm6, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm7, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm8, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm9, %xmm2 + mulpd %xmm0, %xmm2 + addpd %xmm10, %xmm2 + mulpd %xmm2, %xmm0 + addpd %xmm11, %xmm0 + orps %xmm12, %xmm0 + andl $3, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups (%rsp), %xmm1 + movups %xmm1, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call tanh@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN2v_tanh_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dtanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dbP[60*16][2]; + __declspec(align(16)) VUINT32 _dbSignMask[2][2]; + __declspec(align(16)) VUINT32 _dbAbsMask[2][2]; + __declspec(align(16)) VUINT32 _iExpMantMask[4][1]; + __declspec(align(16)) VUINT32 _iExpMask[4][1]; + __declspec(align(16)) VUINT32 _iMinIdxOfsMask[4][1]; + __declspec(align(16)) VUINT32 _iMaxIdxMask[4][1]; +} __svml_dtanh_data_internal; +#endif +__svml_dtanh_data_internal: + /* Polynomial coefficients */ + .quad 0x0000000000000000 /* PL0 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* PH0 = +0.000000000000000000000e-01 */ + .quad 0x3FF0000000000000 /* P1 = +1.000000000000000014103e+00 */ + .quad 0xBD197DEAD79668D3 /* P2 = -2.264132406596103056796e-14 */ + .quad 0xBFD555555553AF3C /* P3 = -3.333333333273349741024e-01 */ + .quad 0xBE052F7CCA134846 /* P4 = -6.165791385711493738399e-10 */ + .quad 0x3FC11111563849D6 /* P5 = +1.333333655353061107201e-01 */ + .quad 0xBEB038623673FFB2 /* P6 = -9.668021563879858950855e-07 */ + .quad 0xBFAB9F685E64022E /* P7 = -5.395055916051593179252e-02 */ + .quad 0xBF2A54E2B28F2207 /* P8 = -2.008940439550829012647e-04 */ + .quad 0x3F97CFB9328A230E /* P9 = +2.325333949059698582189e-02 */ + .quad 0xBF75CA6D61723E02 /* P10 = -5.320002811586290441790e-03 */ + .quad 0x0000000000000000 /* B = +0 */ + .quad 0x3FF0000000000000 /* A = +1.0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C3708A564FAD29A /* PL0 = +1.248663375337163807466e-18 */ + .quad 0x3FC0E6973998DA48 /* PH0 = +1.320370703922029154143e-01 */ + .quad 0x3FEF712EB25C0888 /* P1 = +9.825662120422444519229e-01 */ + .quad 0xBFC09B296F7C1EA9 /* P2 = -1.297351641044220078331e-01 */ + .quad 0xBFD3DD77541EDDA7 /* P3 = -3.103922196855485849143e-01 */ + .quad 0x3FB58FFCF4309615 /* P4 = +8.422833406128689275566e-02 */ + .quad 0x3FBD3ABE845DCF49 /* P5 = +1.141776154670967208833e-01 */ + .quad 0xBFA791DF538C37FA /* P6 = -4.603479285115947936529e-02 */ + .quad 0xBFA4F872F69CD6E8 /* P7 = -4.095801601799370195284e-02 */ + .quad 0x3F9772E49EF6412B /* P8 = +2.289921970583567527179e-02 */ + .quad 0x3F8CBC0807393909 /* P9 = +1.403051635784581776625e-02 */ + .quad 0xBF85F06A30F93319 /* P10 = -1.071246110873285040939e-02 */ + .quad 0xBFC1000000000000 /* B = -.132813 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6004EE5739DEAC /* PL0 = +6.947247374112211856530e-18 */ + .quad 0x3FC2DC968E6E0D62 /* PH0 = +1.473568149050193398786e-01 */ + .quad 0x3FEF4E1E606D96DF /* P1 = +9.782859691010478680677e-01 */ + .quad 0xBFC273BD70994AB9 /* P2 = -1.441571044730005866646e-01 */ + .quad 0xBFD382B548270D2C /* P3 = -3.048527912726111386771e-01 */ + .quad 0x3FB7CD2D582A6B29 /* P4 = +9.297450449450351894400e-02 */ + .quad 0x3FBC1278CCCBF0DB /* P5 = +1.096568584434324642303e-01 */ + .quad 0xBFA9C7F5115B86A1 /* P6 = -5.035367810138536095866e-02 */ + .quad 0xBFA371C21BAF618E /* P7 = -3.797728145554222910481e-02 */ + .quad 0x3F9958943F68417E /* P8 = +2.475196492201935923783e-02 */ + .quad 0x3F8930D5CFFD4152 /* P9 = +1.230017701132682667572e-02 */ + .quad 0xBF875CF7ADD31B76 /* P10 = -1.140779017658897660092e-02 */ + .quad 0xBFC3000000000000 /* B = -.148438 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7EABE24E052A1F /* PL0 = +2.660321779421749543501e-17 */ + .quad 0x3FC4D04783618C71 /* PH0 = +1.626061812886266111366e-01 */ + .quad 0x3FEF2765AF97A4B3 /* P1 = +9.735592298067302883212e-01 */ + .quad 0xBFC443654205FEA5 /* P2 = -1.583067486171689074207e-01 */ + .quad 0xBFD31F2E208A5B97 /* P3 = -2.987780874040536844467e-01 */ + .quad 0x3FB9F235BD339878 /* P4 = +1.013520800512156573576e-01 */ + .quad 0x3FBAD0B0DFCCA141 /* P5 = +1.047468706498238100104e-01 */ + .quad 0xBFABD1B9600E608E /* P6 = -5.433444306908184548967e-02 */ + .quad 0xBFA1CEBEAF07DB58 /* P7 = -3.478046309094534453598e-02 */ + .quad 0x3F9AFC9FB1D8EFD2 /* P8 = +2.635430834764902126383e-02 */ + .quad 0x3F8573444F1AB502 /* P9 = +1.047376028449287564018e-02 */ + .quad 0xBF8874FBC8F24406 /* P10 = -1.194187838544459322219e-02 */ + .quad 0xBFC5000000000000 /* B = -.164063 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7FB199D361A790 /* PL0 = +2.748994907060158996213e-17 */ + .quad 0x3FC6C170259E21F7 /* PH0 = +1.777782615356639783766e-01 */ + .quad 0x3FEEFD17479F7C65 /* P1 = +9.683948897253570478266e-01 */ + .quad 0xBFC609530FE4DF8D /* P2 = -1.721595599753950294577e-01 */ + .quad 0xBFD2B3465D71B4DE /* P3 = -2.921920692959484052676e-01 */ + .quad 0x3FBBFD2D34AC509B /* P4 = +1.093319181057403192166e-01 */ + .quad 0x3FB9778C3C16A0FE /* P5 = +9.948040453912551395183e-02 */ + .quad 0xBFADAC4D9E63C665 /* P6 = -5.795519407719210697372e-02 */ + .quad 0xBFA0139CCAD02D60 /* P7 = -3.139963126894929339124e-02 */ + .quad 0x3F9C5BF43BA6F19D /* P8 = +2.769452680671379432854e-02 */ + .quad 0x3F8190B703350341 /* P9 = +8.576803002712575184772e-03 */ + .quad 0xBF8936606782858A /* P10 = -1.231074634444230850234e-02 */ + .quad 0xBFC7000000000000 /* B = -.179688 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6A917CA3624D50 /* PL0 = +1.152216693509785660691e-17 */ + .quad 0x3FC8AFD7B974FABB /* PH0 = +1.928662925292508878439e-01 */ + .quad 0x3FEECF47624A5D03 /* P1 = +9.628025932060214187231e-01 */ + .quad 0xBFC7C4C2CB4FDE4D /* P2 = -1.856921665891938814679e-01 */ + .quad 0xBFD23F69CB2C1F9D /* P3 = -2.851204380135586155453e-01 */ + .quad 0x3FBDEC5703A03814 /* P4 = +1.168875106670557712458e-01 */ + .quad 0x3FB8095003D0CF15 /* P5 = +9.389209836154706616487e-02 */ + .quad 0xBFAF554B47B10CBB /* P6 = -6.119761705533607365968e-02 */ + .quad 0xBF9C89743FE7BC1B /* P7 = -2.786809577986213853937e-02 */ + .quad 0x3F9D74725B746E7C /* P8 = +2.876452143855921824991e-02 */ + .quad 0x3F7B2D8AFB70B88C /* P9 = +6.635229968237631511880e-03 */ + .quad 0xBF89A0A2883EF6CB /* P10 = -1.251341799058582545252e-02 */ + .quad 0xBFC9000000000000 /* B = -.195313 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7608279E8609CB /* PL0 = +1.910958764623660748269e-17 */ + .quad 0x3FCA9B46D2DDC5E3 /* PH0 = +2.078636674519166172015e-01 */ + .quad 0x3FEE9E0BB72A01A1 /* P1 = +9.567926957534390123919e-01 */ + .quad 0xBFC974FAD10C5330 /* P2 = -1.988824387305156976885e-01 */ + .quad 0xBFD1C40ACCBA4044 /* P3 = -2.775904654781735703430e-01 */ + .quad 0x3FBFBE24E2987853 /* P4 = +1.239951184474830487522e-01 */ + .quad 0x3FB6885B4345E47F /* P5 = +8.801813499839460539687e-02 */ + .quad 0xBFB06563D5670584 /* P6 = -6.404708824176991770896e-02 */ + .quad 0xBF98CD1D620DF6E2 /* P7 = -2.421995078065365147772e-02 */ + .quad 0x3F9E44EF3E844D21 /* P8 = +2.955983943054463683119e-02 */ + .quad 0x3F7325FA0148CAAE /* P9 = +4.674889165971292322643e-03 */ + .quad 0xBF89B4C8556C2D92 /* P10 = -1.255184660614964011319e-02 */ + .quad 0xBFCB000000000000 /* B = -.210938 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6F19DAA20F51D5 /* PL0 = +1.348790537832000351176e-17 */ + .quad 0x3FCC83876CA98E15 /* PH0 = +2.227639465883021474557e-01 */ + .quad 0x3FEE697B662D07CD /* P1 = +9.503762241004040620296e-01 */ + .quad 0xBFCB194C7ED76ACF /* P2 = -2.117095584242946953999e-01 */ + .quad 0xBFD141A19E419762 /* P3 = -2.696308179350720680191e-01 */ + .quad 0x3FC0B89C64BC7B98 /* P4 = +1.306338779331468503007e-01 */ + .quad 0x3FB4F721150BBFC5 /* P5 = +8.189589275184434216748e-02 */ + .quad 0xBFB105AAFAB87898 /* P6 = -6.649273511036069461061e-02 */ + .quad 0xBF94FB3B31248C01 /* P7 = -2.048962104266749732921e-02 */ + .quad 0x3F9ECD31E588709C /* P8 = +3.007963145692880855964e-02 */ + .quad 0x3F664A91A335C105 /* P9 = +2.721104095762541127495e-03 */ + .quad 0xBF89754E32E1E26E /* P10 = -1.243077366619723806134e-02 */ + .quad 0xBFCD000000000000 /* B = -.226563 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6AC6C889D8111D /* PL0 = +1.161245469312620769170e-17 */ + .quad 0x3FCE6864FE55A3D0 /* PH0 = +2.375608674877001114112e-01 */ + .quad 0x3FEE31AEE116B82B /* P1 = +9.435648342384913826391e-01 */ + .quad 0xBFCCB114B69E808B /* P2 = -2.241540805525839833707e-01 */ + .quad 0xBFD0B8AB913BA99D /* P3 = -2.612713735858507980441e-01 */ + .quad 0x3FC1823322BED48A /* P4 = +1.367858810096190233514e-01 */ + .quad 0x3FB35822B7929893 /* P5 = +7.556359273675842651653e-02 */ + .quad 0xBFB18B03CC78D2DA /* P6 = -6.852744810096158580830e-02 */ + .quad 0xBF911CCC3C8D5E5D /* P7 = -1.671141738492420009734e-02 */ + .quad 0x3F9F0DEC2D99B12F /* P8 = +3.032654789278515819797e-02 */ + .quad 0x3F4A28398B4EBD98 /* P9 = +7.982521989244205404918e-04 */ + .quad 0xBF88E60CB2FAB9A4 /* P10 = -1.215753480150000985458e-02 */ + .quad 0xBFCF000000000000 /* B = -.242188 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C89D2B6774FB61D /* PL0 = +4.479593208720169247958e-17 */ + .quad 0x3FD09C744F539BE4 /* PH0 = +2.595492148088267558848e-01 */ + .quad 0x3FEDD823B0400D42 /* P1 = +9.326342050921214825882e-01 */ + .quad 0xBFCEFBF7FF305FCC /* P2 = -2.420644756355144687086e-01 */ + .quad 0xBFCFC01DC4F24A41 /* P3 = -2.480504237797323303990e-01 */ + .quad 0x3FC291A2C26D5548 /* P4 = +1.450694512701977626753e-01 */ + .quad 0x3FB0D562E672D188 /* P5 = +6.575601698097532991976e-02 */ + .quad 0xBFB2201ECC119E06 /* P6 = -7.080261690281738261872e-02 */ + .quad 0xBF8695D50F778D31 /* P7 = -1.102796987010509974642e-02 */ + .quad 0x3F9EEC8CFBC031A0 /* P8 = +3.019924437107734972427e-02 */ + .quad 0xBF6030F0A4D3660A /* P9 = -1.976461417694923328722e-03 */ + .quad 0xBF87845288A4AEF5 /* P10 = -1.148285369398347838494e-02 */ + .quad 0xBFD1000000000000 /* B = -.265625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8B6AAB614D1C8D /* PL0 = +4.756035418366735312727e-17 */ + .quad 0x3FD275F7E1CF7F63 /* PH0 = +2.884502129727392616410e-01 */ + .quad 0x3FED56658F74C9CC /* P1 = +9.167964746359813351341e-01 */ + .quad 0xBFD0ECC045EBD596 /* P2 = -2.644501383614054083635e-01 */ + .quad 0xBFCD5A4BDE179180 /* P3 = -2.293181261476426808811e-01 */ + .quad 0x3FC3C00047D34767 /* P4 = +1.542969084462655120552e-01 */ + .quad 0x3FAAC7CE84FD609F /* P5 = +5.230565427217581251974e-02 */ + .quad 0xBFB288948D2E8B43 /* P6 = -7.239654967137902384931e-02 */ + .quad 0xBF6D6605AAD5A1C0 /* P7 = -3.588687008847041164896e-03 */ + .quad 0x3F9DDB0790848E97 /* P8 = +2.915584392134337382866e-02 */ + .quad 0xBF75FDE291BAD5B4 /* P9 = -5.369076763306269573660e-03 */ + .quad 0xBF84CEA5C52E0A78 /* P10 = -1.015977390284671071888e-02 */ + .quad 0xBFD3000000000000 /* B = -.296875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7139A81C8A6ECF /* PL0 = +1.494049799478574591322e-17 */ + .quad 0x3FD4470650036407 /* PH0 = +3.168350011233659890841e-01 */ + .quad 0x3FECC9A69DFDDD48 /* P1 = +8.996155820631566629678e-01 */ + .quad 0xBFD23DED3A37A09F /* P2 = -2.850297039535778028925e-01 */ + .quad 0xBFCAD302395D51C1 /* P3 = -2.095644741153943890185e-01 */ + .quad 0x3FC4A8FE3F309C22 /* P4 = +1.614072617096278705115e-01 */ + .quad 0x3FA3D161188AA436 /* P5 = +3.870681213931741151586e-02 */ + .quad 0xBFB288CFE5494E98 /* P6 = -7.240008685885823969403e-02 */ + .quad 0x3F6C7903EED8D334 /* P7 = +3.475673371918475361081e-03 */ + .quad 0x3F9BE023CDFB02F6 /* P8 = +2.722221321778569498033e-02 */ + .quad 0xBF80F8296F2C3A95 /* P9 = -8.285831170295390358336e-03 */ + .quad 0xBF8152DF4790049B /* P10 = -8.458847400108650973189e-03 */ + .quad 0xBFD5000000000000 /* B = -.328125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7751FE0FEE8335 /* PL0 = +2.022712113430213599928e-17 */ + .quad 0x3FD60EF7120502A9 /* PH0 = +3.446633983585721261456e-01 */ + .quad 0x3FEC32D951E56E6F /* P1 = +8.812071418319202070776e-01 */ + .quad 0xBFD370255FC004F8 /* P2 = -3.037198481616338996824e-01 */ + .quad 0xBFC832F0EBC6BB41 /* P3 = -1.890545989276351359107e-01 */ + .quad 0x3FC54C99A0FF432F /* P4 = +1.664001499289269127540e-01 */ + .quad 0x3F99DAC0CC283C18 /* P5 = +2.524853941036661688369e-02 */ + .quad 0xBFB227B3896A026D /* P6 = -7.091829399906553280461e-02 */ + .quad 0x3F84663364E1FB19 /* P7 = +9.960557476231411602383e-03 */ + .quad 0x3F9922D70DE07C57 /* P8 = +2.454696676442965935283e-02 */ + .quad 0xBF85C4A4EB6F86BC /* P9 = -1.062897532932837635222e-02 */ + .quad 0xBF7AAB61214FFE17 /* P10 = -6.511096396024671890972e-03 */ + .quad 0xBFD7000000000000 /* B = -.359375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3BFE67F266843B2C /* PL0 = +1.030196791298162288777e-19 */ + .quad 0x3FD7CD3115FC0F16 /* PH0 = +3.718989100163850869407e-01 */ + .quad 0x3FEB92F96CCC2C5B /* P1 = +8.616912007286247079761e-01 */ + .quad 0xBFD4827320135092 /* P2 = -3.204620183216856200247e-01 */ + .quad 0xBFC582B15550168A /* P3 = -1.680509249273891977521e-01 */ + .quad 0x3FC5AC3B9A2E4C31 /* P4 = +1.693186285816366254244e-01 */ + .quad 0x3F88FA599FCADAFB /* P5 = +1.219625491044728129762e-02 */ + .quad 0xBFB16EC8F5CA169E /* P6 = -6.809669495313605642174e-02 */ + .quad 0x3F90140EFC748BBE /* P7 = +1.570151725639922719844e-02 */ + .quad 0x3F95CFC49C1A28DC /* P8 = +2.130038454792147768770e-02 */ + .quad 0xBF8946ED8B1BF454 /* P9 = -1.234231549050882816697e-02 */ + .quad 0xBF7239E55C1DD50F /* P10 = -4.449745117985472755606e-03 */ + .quad 0xBFD9000000000000 /* B = -.390625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6412330191189C /* PL0 = +8.704448096175471149661e-18 */ + .quad 0x3FD9812B3B03F0A5 /* PH0 = +3.985088421175169703936e-01 */ + .quad 0x3FEAEB08C3C0E84D /* P1 = +8.411907027541559254748e-01 */ + .quad 0xBFD57446B1BC46CF /* P2 = -3.352219329545790787820e-01 */ + .quad 0xBFC2CA9ABC0444AD /* P3 = -1.468079965639267634401e-01 */ + .quad 0x3FC5CA95F9460D18 /* P4 = +1.702449290424759093710e-01 */ + .quad 0xBF2C2DAA35DD05C3 /* P5 = -2.149839664813813012186e-04 */ + .quad 0xBFB069A516EEB75D /* P6 = -6.411201295733578195472e-02 */ + .quad 0x3F9512716416FDC7 /* P7 = +2.057816670798986720058e-02 */ + .quad 0x3F921630CB1319A3 /* P8 = +1.766277541607908852593e-02 */ + .quad 0xBF8B76DA2EC99526 /* P9 = -1.341028647693549562145e-02 */ + .quad 0xBF63A97474A161E4 /* P10 = -2.400138332671485493040e-03 */ + .quad 0xBFDB000000000000 /* B = -.421875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C89B79F5783381C /* PL0 = +4.461236087774530799537e-17 */ + .quad 0x3FDB2A6C993B829D /* PH0 = +4.244643684778937609003e-01 */ + .quad 0x3FEA3C0C1FBA328C /* P1 = +8.198299998926627915155e-01 */ + .quad 0xBFD6457212F78DE0 /* P2 = -3.479886231636708581604e-01 */ + .quad 0xBFC0129BDA380A66 /* P3 = -1.255678954622282824818e-01 */ + .quad 0x3FC5AB77F388FBDE /* P4 = +1.692953051696965507089e-01 */ + .quad 0xBF8822F3A6CADB7C /* P5 = -1.178541519889874597783e-02 */ + .quad 0xBFAE4A876370A4BD /* P6 = -5.916236008517603590739e-02 */ + .quad 0x3F991A89BC3B7710 /* P7 = +2.451529704455085335710e-02 */ + .quad 0x3F8C4A4328204D4B /* P8 = +1.381351915555364098800e-02 */ + .quad 0xBF8C5F921D01EC0B /* P9 = -1.385416174911393178490e-02 */ + .quad 0xBF3EE844C5B79FB8 /* P10 = -4.716079617694784908234e-04 */ + .quad 0xBFDD000000000000 /* B = -.453125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C73FA437AD7AD87 /* PL0 = +1.732779905745858845932e-17 */ + .quad 0x3FDCC88C9902CF45 /* PH0 = +4.497405523536495697279e-01 */ + .quad 0x3FE9870845162D1D /* P1 = +7.977334355686341748810e-01 */ + .quad 0xBFD6F62358F73DA8 /* P2 = -3.587730759436120677668e-01 */ + .quad 0xBFBAC4345D675FE1 /* P3 = -1.045563438450467661101e-01 */ + .quad 0x3FC5539DA8287019 /* P4 = +1.666142531474868131862e-01 */ + .quad 0xBF96E3E0DC04A09F /* P5 = -2.235366194614185212822e-02 */ + .quad 0xBFAB5EC7147C207D /* P6 = -5.345747113284546871398e-02 */ + .quad 0x3F9C24166FFA7A58 /* P7 = +2.748141344511120915667e-02 */ + .quad 0x3F8451B907819844 /* P8 = +9.921498815128277696693e-03 */ + .quad 0xBF8C1C6D19191FCB /* P9 = -1.372609360545586670239e-02 */ + .quad 0x3F547372DF72E35A /* P10 = +1.248228245272117756098e-03 */ + .quad 0xBFDF000000000000 /* B = -.484375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C848FE06EE49950 /* PL0 = +3.566941590788961528958e-17 */ + .quad 0x3FDF20211A36475D /* PH0 = +4.863360172249622803697e-01 */ + .quad 0x3FE86E67E6B80AC2 /* P1 = +7.634772783497611574659e-01 */ + .quad 0xBFD7C37C55474D9B /* P2 = -3.713064987943767913461e-01 */ + .quad 0xBFB2EBF15F3CB036 /* P3 = -7.391270232318521952684e-02 */ + .quad 0x3FC4718C8EF6E3AA /* P4 = +1.597152422016539530950e-01 */ + .quad 0xBFA277F8394E9B07 /* P5 = -3.607154559658991932071e-02 */ + .quad 0xBFA680312AB207E3 /* P6 = -4.394677778419955009224e-02 */ + .quad 0x3F9EDC9A8B57E286 /* P7 = +3.013841128810892143223e-02 */ + .quad 0x3F71B8C5E648EAF6 /* P8 = +4.326603932492947851719e-03 */ + .quad 0xBF89DB218356730C /* P9 = -1.262499029217558458029e-02 */ + .quad 0x3F6B05728E6EBC8E /* P10 = +3.298496001171330815865e-03 */ + .quad 0xBFE1000000000000 /* B = -.53125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8429831EDD94DE /* PL0 = +3.497576705878673192147e-17 */ + .quad 0x3FE10AF47E0BF610 /* PH0 = +5.325872861719194162333e-01 */ + .quad 0x3FE6EC5879F87EEE /* P1 = +7.163507826080299761242e-01 */ + .quad 0xBFD86AD001BFE200 /* P2 = -3.815193192563413204129e-01 */ + .quad 0xBFA239045B661385 /* P3 = -3.559125533778398983564e-02 */ + .quad 0x3FC2B4572D9CC147 /* P4 = +1.461285565105845078038e-01 */ + .quad 0xBFA99F4F01740705 /* P5 = -5.004355328311586406115e-02 */ + .quad 0xBF9F449C484F4879 /* P6 = -3.053516570418721511214e-02 */ + .quad 0x3F9F5F42169D7DDE /* P7 = +3.063681853325116830798e-02 */ + .quad 0xBF6111B1BA632A97 /* P8 = -2.083632588527460989469e-03 */ + .quad 0xBF84725FBE5B6E61 /* P9 = -9.983776089419639342530e-03 */ + .quad 0x3F7438A2986CFA9C /* P10 = +4.936823976832951342488e-03 */ + .quad 0xBFE3000000000000 /* B = -.59375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6BE9160BFB3505 /* PL0 = +1.210424670976053242391e-17 */ + .quad 0x3FE26D76F73233C7 /* PH0 = +5.758623912857893101247e-01 */ + .quad 0x3FE56363B5B93937 /* P1 = +6.683825063026124740752e-01 */ + .quad 0xBFD8A2244B27297E /* P2 = -3.848963483730115724200e-01 */ + .quad 0xBF52CA2F101EEF63 /* P3 = -1.146837196286797844817e-03 */ + .quad 0x3FC081BC342243AD /* P4 = +1.289592032012739958675e-01 */ + .quad 0xBFAE38DB4A932344 /* P5 = -5.902753148399722719732e-02 */ + .quad 0xBF91F814D4AE90C6 /* P6 = -1.754791782481459457885e-02 */ + .quad 0x3F9D056AE193C4F3 /* P7 = +2.834097863973723355792e-02 */ + .quad 0xBF7BD0B502D8F3A0 /* P8 = -6.790835451792626336974e-03 */ + .quad 0xBF7B763F7BB8AE2F /* P9 = -6.704566938008179114124e-03 */ + .quad 0x3F76036F42D9AB69 /* P10 = +5.374369252971835729099e-03 */ + .quad 0xBFE5000000000000 /* B = -.65625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8B64AF0450486E /* PL0 = +4.751979286662385162741e-17 */ + .quad 0x3FE3B75F8BCB742D /* PH0 = +6.161344271055263499548e-01 */ + .quad 0x3FE3DA23BC12369F /* P1 = +6.203783677353447780947e-01 */ + .quad 0xBFD8768FF4B46416 /* P2 = -3.822364701932782367281e-01 */ + .quad 0x3F9D67CB8AD9CB1A /* P3 = +2.871625933625941117406e-02 */ + .quad 0x3FBC168CB7827DF4 /* P4 = +1.097190807363331305006e-01 */ + .quad 0xBFB03A2B83C9272E /* P5 = -6.338760344911228324430e-02 */ + .quad 0xBF789FEB595297DC /* P6 = -6.011885959344067548074e-03 */ + .quad 0x3F98BD01B4C335E7 /* P7 = +2.415850320612902513532e-02 */ + .quad 0xBF83BADC303D6535 /* P8 = -9.633751127398152979976e-03 */ + .quad 0xBF6C54E7A1C1E3F3 /* P9 = -3.458454519258407989501e-03 */ + .quad 0x3F7408394B7EF3E7 /* P10 = +4.890655334688332484537e-03 */ + .quad 0xBFE7000000000000 /* B = -.71875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6A48557F6E0D3E /* PL0 = +1.139824111505584215867e-17 */ + .quad 0x3FE4E8D895B010DC /* PH0 = +6.534235881413468227663e-01 */ + .quad 0x3FE25652FAAF8A73 /* P1 = +5.730376144604875448991e-01 */ + .quad 0xBFD7F6C3A57C444B /* P2 = -3.744362941807295084434e-01 */ + .quad 0x3FAB7866E3F99EBE /* P3 = +5.365296872042567001598e-02 */ + .quad 0x3FB6FA1DF47CCD40 /* P4 = +8.975398272450707099784e-02 */ + .quad 0xBFB05508D3741B8E /* P5 = -6.379752314033580026840e-02 */ + .quad 0x3F6C3EFDF7BB279C /* P6 = +3.448005705512137236209e-03 */ + .quad 0x3F9372BADD6D3E27 /* P7 = +1.899234749299530050806e-02 */ + .quad 0xBF860FD5AE65F3DA /* P8 = -1.077238977881649471165e-02 */ + .quad 0xBF47266FFB07E628 /* P9 = -7.064863949032872448118e-04 */ + .quad 0x3F6F9763992C2A05 /* P10 = +3.856367614735181120799e-03 */ + .quad 0xBFE9000000000000 /* B = -.78125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6BB6A2B194E3AB /* PL0 = +1.201878007209462528697e-17 */ + .quad 0x3FE602609AAE7C22 /* PH0 = +6.877902051090851731630e-01 */ + .quad 0x3FE0DCBAFE191C7F /* P1 = +5.269446337560025312137e-01 */ + .quad 0xBFD732028428A9FB /* P2 = -3.624273577321727538225e-01 */ + .quad 0x3FB2D92389BE065B /* P3 = +7.362577545975439796588e-02 */ + .quad 0x3FB1F6A9C8C49993 /* P4 = +7.017003203927733370937e-02 */ + .quad 0xBFAF47C0B50B56EE /* P5 = -6.109430513394707378526e-02 */ + .quad 0x3F85A8EDD1356223 /* P6 = +1.057611269668352068104e-02 */ + .quad 0x3F8BE05C5CD1B4FA /* P7 = +1.361152799855823798207e-02 */ + .quad 0xBF85A0EFE4552F76 /* P8 = -1.056086936537046752272e-02 */ + .quad 0x3F559F2A6A356194 /* P9 = +1.319686337259627831943e-03 */ + .quad 0x3F6576F5E989208D /* P10 = +2.620201394425042596201e-03 */ + .quad 0xBFEB000000000000 /* B = -.84375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C80328BD86C8B74 /* PL0 = +2.809809047161267929701e-17 */ + .quad 0x3FE704BB1B7FCB81 /* PH0 = +7.193275010198335595035e-01 */ + .quad 0x3FDEE264AAD6C40C /* P1 = +4.825679462765613089739e-01 */ + .quad 0xBFD637493CE659F1 /* P2 = -3.471243948673921548357e-01 */ + .quad 0x3FB6BE3A3DEE6F4A /* P3 = +8.884014141079635303208e-02 */ + .quad 0x3FAA85EB6470AC0F /* P4 = +5.180297471118688523488e-02 */ + .quad 0xBFACC0146EA4858D /* P5 = -5.615295267694895314457e-02 */ + .quad 0x3F8F8FB683CDDAC5 /* P6 = +1.541082944616557159055e-02 */ + .quad 0x3F819515DEE2CB91 /* P7 = +8.585139145315585602547e-03 */ + .quad 0xBF834E45E6AF9EA1 /* P8 = -9.426637747267209169415e-03 */ + .quad 0x3F65250F197CA56D /* P9 = +2.581147662472352252568e-03 */ + .quad 0x3F57A766026D036C /* P10 = +1.443719500187702367690e-03 */ + .quad 0xBFED000000000000 /* B = -.90625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C716F7EEF7B61AD /* PL0 = +1.512291215142578135651e-17 */ + .quad 0x3FE7F0E1A4CD846E /* PH0 = +7.481544703297353660076e-01 */ + .quad 0x3FDC2D4CC872DC09 /* P1 = +4.402648885256331012598e-01 */ + .quad 0xBFD514A99F92ED53 /* P2 = -3.293861444796750250530e-01 */ + .quad 0x3FB9846A6CF2F337 /* P3 = +9.967675361526749494844e-02 */ + .quad 0x3FA20896939AB161 /* P4 = +3.522177268800664413493e-02 */ + .quad 0xBFA97E801F31EE0D /* P5 = -4.979324703978358553405e-02 */ + .quad 0x3F92A11F47B82085 /* P6 = +1.819275737037219740638e-02 */ + .quad 0x3F717D70FE289C34 /* P7 = +4.270020845559097605514e-03 */ + .quad 0xBF7FDCF1D3F6CE2D /* P8 = -7.779068604054678540132e-03 */ + .quad 0x3F69F607E81AF6B6 /* P9 = +3.169074480722534625181e-03 */ + .quad 0x3F3F925C80D0F889 /* P10 = +4.817462766516585511824e-04 */ + .quad 0xBFEF000000000000 /* B = -.96875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C931A11D7E8606E /* PL0 = +6.627280241435322692188e-17 */ + .quad 0x3FE92BFB370D9B71 /* PH0 = +7.866188121086975515439e-01 */ + .quad 0x3FD866160E454111 /* P1 = +3.812308444367014680480e-01 */ + .quad 0xBFD33149F3801DBA /* P2 = -2.998833539899937679796e-01 */ + .quad 0x3FBBDB6D4C949899 /* P3 = +1.088169395412442909023e-01 */ + .quad 0x3F8D6AB2A74B9343 /* P4 = +1.436366627735597372494e-02 */ + .quad 0xBFA404D1047C5D72 /* P5 = -3.909924678571997970917e-02 */ + .quad 0x3F93C47D9ACCD919 /* P6 = +1.930423981976856424661e-02 */ + .quad 0xBF41B755642CFF1B /* P7 = -5.406538915408738478158e-04 */ + .quad 0xBF74B5301AA1E788 /* P8 = -5.055606752756853900641e-03 */ + .quad 0x3F69A84C5B2A3E68 /* P9 = +3.132008679422249529120e-03 */ + .quad 0xBF3CF47830328C11 /* P10 = -4.418176105877589308931e-04 */ + .quad 0xBFF1000000000000 /* B = -1.0625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C884D471B8FD396 /* PL0 = +4.215701792312937090514e-17 */ + .quad 0x3FEA8DBCBC31897A /* PH0 = +8.298019099859594849278e-01 */ + .quad 0x3FD3EE730537C8EA /* P1 = +3.114287901836535219818e-01 */ + .quad 0xBFD08A05AD27CE32 /* P2 = -2.584242049190123217982e-01 */ + .quad 0x3FBC5255406F84B6 /* P3 = +1.106313021005175045399e-01 */ + .quad 0xBF772FA2F633AA5E /* P4 = -5.660664147607434209241e-03 */ + .quad 0xBF99DD8E4C473FC4 /* P5 = -2.525923100057504533247e-02 */ + .quad 0x3F9183C935B6495D /* P6 = +1.710428610165003372069e-02 */ + .quad 0xBF70471A3A591480 /* P7 = -3.974058583087303228038e-03 */ + .quad 0xBF603DDD4DEBB9A4 /* P8 = -1.982624278176818987264e-03 */ + .quad 0x3F62591E44D3C17F /* P9 = +2.239760512218135956425e-03 */ + .quad 0xBF4C195D3A9B1AB4 /* P10 = -8.575158328419569430544e-04 */ + .quad 0xBFF3000000000000 /* B = -1.1875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C90DD1C9BFF7F64 /* PL0 = +5.850777430004479798187e-17 */ + .quad 0x3FEBAD50A4A68BC1 /* PH0 = +8.649066177207417327466e-01 */ + .quad 0x3FD01FBA72CEE1A5 /* P1 = +2.519365426228666233893e-01 */ + .quad 0xBFCBE432F647C4D6 /* P2 = -2.179015829602010702633e-01 */ + .quad 0x3FBABF92B6E5AC73 /* P3 = +1.044856735731387955105e-01 */ + .quad 0xBF922983AA24E217 /* P4 = -1.773648954369563555378e-02 */ + .quad 0xBF8C72214C14E23A /* P5 = -1.388956082756564056328e-02 */ + .quad 0x3F8ACB4D1F388E8B /* P6 = +1.308307887581540972153e-02 */ + .quad 0xBF740EF8B4A2EE3B /* P7 = -4.897090441029978580995e-03 */ + .quad 0xBF0EA9F30C8DC900 /* P8 = -5.848668076326342477133e-05 */ + .quad 0x3F53CC40D18713AE /* P9 = +1.208365725788622757410e-03 */ + .quad 0xBF4848B86029CBA1 /* P10 = -7.410908004444779592485e-04 */ + .quad 0xBFF5000000000000 /* B = -1.3125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8FB61781D22681 /* PL0 = +5.501032995458057064843e-17 */ + .quad 0x3FEC950A3340C8BF /* PH0 = +8.931933404003514764824e-01 */ + .quad 0x3FC9E1DFFD385423 /* P1 = +2.022056566644617586005e-01 */ + .quad 0xBFC71E2FF88EBA23 /* P2 = -1.806087459239772032583e-01 */ + .quad 0x3FB80AEBD07AB5BA /* P3 = +9.391664352252506838449e-02 */ + .quad 0xBF98404E27EAE6ED /* P4 = -2.368280523908243895884e-02 */ + .quad 0xBF772DA520B5006E /* P5 = -5.658764868087568802107e-03 */ + .quad 0x3F824C9268AF9423 /* P6 = +8.935111827620250551925e-03 */ + .quad 0xBF722AE76D206AE3 /* P7 = -4.435447701349490160113e-03 */ + .quad 0x3F4B807F56298D5E /* P8 = +8.392926941493230644497e-04 */ + .quad 0x3F3D71027DF95D2A /* P9 = +4.492407879061627603159e-04 */ + .quad 0xBF3EBD17676755FB /* P10 = -4.690343988874298905483e-04 */ + .quad 0xBFF7000000000000 /* B = -1.4375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C95393C63CE8224 /* PL0 = +7.363407705201031038415e-17 */ + .quad 0x3FED4E6F464286B0 /* PH0 = +9.158245441687622445670e-01 */ + .quad 0x3FC4A45842B7DE1E /* P1 = +1.612654042980787191461e-01 */ + .quad 0xBFC2E7885AFDD3D0 /* P2 = -1.476908153814791087327e-01 */ + .quad 0x3FB4DD6DD51D3FEB /* P3 = +8.150373890862254580204e-02 */ + .quad 0xBF9A05D3ADAB489C /* P4 = -2.541285274021075503042e-02 */ + .quad 0xBF3459B643B4995C /* P5 = -3.105230313899165257622e-04 */ + .quad 0x3F766B30745F2E3A /* P6 = +5.473317409222350365811e-03 */ + .quad 0xBF6C2C891E555BDF /* P7 = -3.439204988051155730940e-03 */ + .quad 0x3F5194F30D6C576D /* P8 = +1.073109966176012791522e-03 */ + .quad 0x3EF4DBB43C3132A2 /* P9 = +1.989194766975849961365e-05 */ + .quad 0xBF2E45EBAB3C15A0 /* P10 = -2.309656316514087783666e-04 */ + .quad 0xBFF9000000000000 /* B = -1.5625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C75111669651DAA /* PL0 = +1.827249135453834384396e-17 */ + .quad 0x3FEDE1EB5937518F /* PH0 = +9.338280432225917193634e-01 */ + .quad 0x3FC06129C7C8EBB1 /* P1 = +1.279651856910653382507e-01 */ + .quad 0xBFBE9763041064E1 /* P2 = -1.194974789545031421774e-01 */ + .quad 0x3FB1A5B9F9113928 /* P3 = +6.893503504509068635308e-02 */ + .quad 0xBF992145039F9AFE /* P4 = -2.454097590080105816526e-02 */ + .quad 0x3F66CB116EA49C89 /* P5 = +2.782377288116648315142e-03 */ + .quad 0x3F67F972FDF30001 /* P6 = +2.926563829163342740100e-03 */ + .quad 0xBF63A7B5975F02F3 /* P7 = -2.399305983061922438601e-03 */ + .quad 0x3F4FDE7B8777F4C8 /* P8 = +9.725669069095216373599e-04 */ + .quad 0xBF25918876626BA4 /* P9 = -1.645545082212515656240e-04 */ + .quad 0xBF1495123C991F00 /* P10 = -7.851527984669912693674e-05 */ + .quad 0xBFFB000000000000 /* B = -1.6875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9F29A5B7426D27 /* PL0 = +1.081172820484012446345e-16 */ + .quad 0x3FEE56B6F3EFABFC /* PH0 = +9.480852856044061915952e-01 */ + .quad 0x3FB9E3EFD94BB9FC /* P1 = +1.011342912204113371518e-01 */ + .quad 0xBFB88BD9760FECA7 /* P2 = -9.588393337610288420285e-02 */ + .quad 0x3FAD48A0350B3ACF /* P3 = +5.719471595295077387313e-02 */ + .quad 0xBF96CC6A5110F129 /* P4 = -2.226415748394675367257e-02 */ + .quad 0x3F71934687170384 /* P5 = +4.290843485649345772606e-03 */ + .quad 0x3F5407BAF73B3DF9 /* P6 = +1.222546180475235334287e-03 */ + .quad 0xBF591B626C0646DD /* P7 = -1.532407870488964407324e-03 */ + .quad 0x3F48B0E1DD283558 /* P8 = +7.535078860329375669277e-04 */ + .quad 0xBF2B322292840D2B /* P9 = -2.074877932117605962646e-04 */ + .quad 0xBE99E4061120C741 /* P10 = -3.858017559892704559672e-07 */ + .quad 0xBFFD000000000000 /* B = -1.8125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6AF8C2041C67CD /* PL0 = +1.169711482626385762338e-17 */ + .quad 0x3FEEB2DFEDD5EC93 /* PH0 = +9.593352933146824801369e-01 */ + .quad 0x3FB465A205CFB638 /* P1 = +7.967579500083210999681e-02 */ + .quad 0xBFB3914BF68D39FF /* P2 = -7.643580216720378576778e-02 */ + .quad 0x3FA7F21A08C5C734 /* P3 = +4.676896435820623621673e-02 */ + .quad 0xBF93DA9560EA9960 /* P4 = -1.938851741820124550772e-02 */ + .quad 0x3F73953FEC62820E /* P5 = +4.781007481284861359820e-03 */ + .quad 0x3F2749D5E1273E3C /* P6 = +1.776765426044646108071e-04 */ + .quad 0xBF4D46B0B498CE5A /* P7 = -8.934367007839658352859e-04 */ + .quad 0x3F4153D680E1F4C4 /* P8 = +5.287930851093571206574e-04 */ + .quad 0xBF28477014ECA6A2 /* P9 = -1.852344816708944640949e-04 */ + .quad 0x3EFFAC54E07CEB4B /* P10 = +3.020588886147182143902e-05 */ + .quad 0xBFFF000000000000 /* B = -1.9375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7A8AF2BB2231F2 /* PL0 = +2.302217989249372577466e-17 */ + .quad 0x3FEF1994DF724FC8 /* PH0 = +9.718727459135090285258e-01 */ + .quad 0x3FAC65B1BC0C9D58 /* P1 = +5.546336575053583942603e-02 */ + .quad 0xBFAB9937BDA747C8 /* P2 = -5.390333356957871365599e-02 */ + .quad 0x3FA15B42D9EF931C /* P3 = +3.389939222669210777241e-02 */ + .quad 0xBF8EACD8E8507A3C /* P4 = -1.497811755149058215502e-02 */ + .quad 0x3F7263A15721C682 /* P5 = +4.489546046998806349050e-03 */ + .quad 0xBF42A032ACDC3B32 /* P6 = -5.684134900735048121829e-04 */ + .quad 0xBF3431E79B5AD185 /* P7 = -3.081503340170088810438e-04 */ + .quad 0x3F31B51667C7DF5E /* P8 = +2.701930714290502424828e-04 */ + .quad 0xBF1F8709579250AD /* P9 = -1.202678157759563704341e-04 */ + .quad 0x3F01ED8ED1BF9595 /* P10 = +3.419487094883790833778e-05 */ + .quad 0xC001000000000000 /* B = -2.125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C86F3F7C3DAFC55 /* PL0 = +3.981710680748877459333e-17 */ + .quad 0x3FEF73776B2AA2DB /* PH0 = +9.828450291725759901951e-01 */ + .quad 0x3FA16A7FC4D7B900 /* P1 = +3.401564863075812007064e-02 */ + .quad 0xBFA11E03803AD621 /* P2 = -3.343211117082156940532e-02 */ + .quad 0x3F9609591597297F /* P3 = +2.152003473546803654658e-02 */ + .quad 0xBF847E74ED9BBB0C /* P4 = -1.000682211039596246436e-02 */ + .quad 0x3F6BFF771725CD65 /* P5 = +3.417713736035987187864e-03 */ + .quad 0xBF491D1FF73C18FA /* P6 = -7.664114077392807421000e-04 */ + .quad 0x3EF53EE467B51DC5 /* P7 = +2.026145237479599375099e-05 */ + .quad 0x3F160135BE0D94A0 /* P8 = +8.394136922403255700685e-05 */ + .quad 0xBF0B32CB1D276A40 /* P9 = -5.187685350778849443841e-05 */ + .quad 0x3EF4DAF70C12D555 /* P10 = +1.988919462255396826584e-05 */ + .quad 0xC003000000000000 /* B = -2.375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C19DBF4E2E5B7DC /* PL0 = +3.504575836708380670219e-19 */ + .quad 0x3FEFAA7934B75EBD /* PH0 = +9.895597486128832054320e-01 */ + .quad 0x3F9545200830A42C /* P1 = +2.077150392520736492125e-02 */ + .quad 0xBF950C46D285F6BC /* P2 = -2.055464420253970271376e-02 */ + .quad 0x3F8B79F5BFC6513F /* P3 = +1.341621390819425058164e-02 */ + .quad 0xBF7A50ADAD777898 /* P4 = -6.424597194806612772505e-03 */ + .quad 0x3F633A19BE8255E3 /* P5 = +2.347040444940816227383e-03 */ + .quad 0xBF44E609BC2557B7 /* P6 = -6.377742322836087134324e-04 */ + .quad 0x3F1AFCBAD60EAACD /* P7 = +1.029480968230231421206e-04 */ + .quad 0x3EE80476AC34A8EF /* P8 = +1.145240583485084317660e-05 */ + .quad 0xBEF278E23DE463E9 /* P9 = -1.761646478213091821804e-05 */ + .quad 0x3EE209FAF377264D /* P10 = +8.601658563106529694651e-06 */ + .quad 0xC005000000000000 /* B = -2.625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C979D62702C631C /* PL0 = +8.193023793215066385979e-17 */ + .quad 0x3FEFCC04CDBCDC4B /* PH0 = +9.936546343150295390600e-01 */ + .quad 0x3F89E87D088D269A /* P1 = +1.265046770426474576547e-02 */ + .quad 0xBF89BE6721012B80 /* P2 = -1.257019586059526836624e-02 */ + .quad 0x3F80F1C13E8D39D3 /* P3 = +8.273610803056031004326e-03 */ + .quad 0xBF7082DBC9602757 /* P4 = -4.031046430108839563004e-03 */ + .quad 0x3F590BE9BD4E0A11 /* P5 = +1.528719197467002507978e-03 */ + .quad 0xBF3DCC2BEF6D0283 /* P6 = -4.546744598208711809986e-04 */ + .quad 0x3F1A08065C4A8E85 /* P7 = +9.930170842636406837764e-05 */ + .quad 0xBEE528117D0410F3 /* P8 = -1.008821337267942266431e-05 */ + .quad 0xBED0BE73A44FF565 /* P9 = -3.992069257383521775961e-06 */ + .quad 0x3EC9B0C11E342E38 /* P10 = +3.062539904901699218737e-06 */ + .quad 0xC007000000000000 /* B = -2.875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C804B931AD7A3CC /* PL0 = +2.826768921701616830245e-17 */ + .quad 0x3FEFE06EB0688212 /* PH0 = +9.961465306733450209009e-01 */ + .quad 0x3F7F81BD8876224D /* P1 = +7.692089427458426472642e-03 */ + .quad 0xBF7F62A8C699A963 /* P2 = -7.662448196791823756776e-03 */ + .quad 0x3F74C31E2B2A6A28 /* P3 = +5.068891378551522166321e-03 */ + .quad 0xBF6470D537F16227 /* P4 = -2.495209162173734080001e-03 */ + .quad 0x3F4FAEEF61C89673 /* P5 = +9.668988091717359455754e-04 */ + .quad 0xBF33C5E80B349783 /* P6 = -3.017131341088651514023e-04 */ + .quad 0x3F138F3D31037A6B /* P7 = +7.461367590931028650557e-05 */ + .quad 0xBEEB3C780996FFE3 /* P8 = -1.298723536791163711556e-05 */ + .quad 0x3E9D0C75BC8BFEFC /* P9 = +4.328589367358221917138e-07 */ + .quad 0x3EAC3865227764D4 /* P10 = +8.410302755848104487452e-07 */ + .quad 0xC009000000000000 /* B = -3.125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C5B978B202749F9 /* PL0 = +5.983054034451594408315e-18 */ + .quad 0x3FEFECD6B7EA3128 /* PH0 = +9.976609794698889643882e-01 */ + .quad 0x3F73238B786137FE /* P1 = +4.672570043181776968058e-03 */ + .quad 0xBF731815ACEA072E /* P2 = -4.661640805922390930706e-03 */ + .quad 0x3F6956F0816D5AEE /* P3 = +3.093213784647877798933e-03 */ + .quad 0xBF591A16286C4885 /* P4 = -1.532098425461232453877e-03 */ + .quad 0x3F43B3E3A00C6096 /* P5 = +6.012784434430592468442e-04 */ + .quad 0xBF29441B2A56DEC7 /* P6 = -1.927645836710038499293e-04 */ + .quad 0x3F0A99C3A2E857B6 /* P7 = +5.073669705184196724674e-05 */ + .quad 0xBEE61CB034DDC151 /* P8 = -1.054385361573597042258e-05 */ + .quad 0x3EB792BBC76D6107 /* P9 = +1.405070887824641788698e-06 */ + .quad 0x3E761472362A16F0 /* P10 = +8.225391704739515383837e-08 */ + .quad 0xC00B000000000000 /* B = -3.375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9C290AFCBDE00D /* PL0 = +9.770074992945060684926e-17 */ + .quad 0x3FEFF45F6D36133A /* PH0 = +9.985806592017987259879e-01 */ + .quad 0x3F673CEC093032DE /* P1 = +2.836667068100913999228e-03 */ + .quad 0xBF67347A7CD844D5 /* P2 = -2.832640870800243808078e-03 */ + .quad 0x3F5EDA25530355DB /* P3 = +1.883064698679040793627e-03 */ + .quad 0xBF4EAD3BBABC1BA9 /* P4 = -9.361783645268534848806e-04 */ + .quad 0x3F3842E61CD35432 /* P5 = +3.701984213198588740338e-04 */ + .quad 0xBF1F9AB7FD1A3DDD /* P6 = -1.205611036090218544867e-04 */ + .quad 0x3F0136C154EA3DED /* P7 = +3.283288480304320224929e-05 */ + .quad 0xBEDF12807F721E66 /* P8 = -7.408207230892235753013e-06 */ + .quad 0x3EB5B53687AD5112 /* P9 = +1.293889481520047941659e-06 */ + .quad 0xBE801E90FBFED147 /* P10 = -1.200988872775447204019e-07 */ + .quad 0xC00D000000000000 /* B = -3.625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9E323294294877 /* PL0 = +1.047637125334028950603e-16 */ + .quad 0x3FEFF8F21CDAAA62 /* PH0 = +9.991388858373506653976e-01 */ + .quad 0x3F5C3470628813F2 /* P1 = +1.721486807697344658108e-03 */ + .quad 0xBF5C2E38AC6FF8D2 /* P2 = -1.720004411026422324849e-03 */ + .quad 0x3F52C13234626F43 /* P3 = +1.144694354969070234454e-03 */ + .quad 0xBF42B0A47DF47BB4 /* P4 = -5.703738387728891173354e-04 */ + .quad 0x3F2DB2889E32FBFD /* P5 = +2.265731592156760387344e-04 */ + .quad 0xBF1385FBD54C5A55 /* P6 = -7.447576110695385196414e-05 */ + .quad 0x3EF5AFA812C6984E /* P7 = +2.068153223579892541184e-05 */ + .quad 0xBED47097C188A03C /* P8 = -4.873231795467276043290e-06 */ + .quad 0x3EAFF2B982F7EE8C /* P9 = +9.521288628073486288914e-07 */ + .quad 0xBE828EC5B57D424D /* P10 = -1.382656715739529384702e-07 */ + .quad 0xC00F000000000000 /* B = -3.875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9BA40DA6983BEC /* PL0 = +9.589840482158163453169e-17 */ + .quad 0x3FEFFCAAC3F20E65 /* PH0 = +9.995931460438894911036e-01 */ + .quad 0x3F4AA87CF664754C /* P1 = +8.135423820793490331956e-04 */ + .quad 0xBF4AA5B62919E224 /* P2 = -8.132113891426467676310e-04 */ + .quad 0x3F41C01B53B0B312 /* P3 = +5.416997368051531710388e-04 */ + .quad 0xBF31B8B54D091751 /* P4 = -2.704088811110632606347e-04 */ + .quad 0x3F1C431305954ECC /* P5 = +1.078110084525254933728e-04 */ + .quad 0xBF02B7DEAD0D44E6 /* P6 = -3.570221236393906131126e-05 */ + .quad 0x3EE51C6EFF109EA9 /* P7 = +1.006654199116272154479e-05 */ + .quad 0xBEC48CFB08072D17 /* P8 = -2.449834994621594976610e-06 */ + .quad 0x3EA1585EC59CAE34 /* P9 = +5.169271261920604503617e-07 */ + .quad 0xBE78832BAF950BA9 /* P10 = -9.131575131209528255629e-08 */ + .quad 0xC011000000000000 /* B = -4.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8FBF237F4AFE10 /* PL0 = +5.507163370275307643966e-17 */ + .quad 0x3FEFFEC61279A3A4 /* PH0 = +9.998503075449787225182e-01 */ + .quad 0x3F339E78281A00EA /* P1 = +2.993625022114214863645e-04 */ + .quad 0xBF339DB7B072AD62 /* P2 = -2.993176899035080028902e-04 */ + .quad 0x3F2A259E658EF4E4 /* P3 = +1.994853835451177669594e-04 */ + .quad 0xBF1A219C312B10BA /* P4 = -9.968295880030927192162e-05 */ + .quad 0x3F04E146B4F5F4B7 /* P5 = +3.982541113154699160876e-05 */ + .quad 0xBEEBC5F137088210 /* P6 = -1.324329943580649487333e-05 */ + .quad 0x3ECF96736E300B00 /* P7 = +3.765547135882256916132e-06 */ + .quad 0xBEAF4874840B91EB /* P8 = -9.323068824421825762292e-07 */ + .quad 0x3E8B6AB2B5C8FD3F /* P9 = +2.042709991312793245971e-07 */ + .quad 0xBE650BCCE62FD2B7 /* P10 = -3.920140725219944650830e-08 */ + .quad 0xC013000000000000 /* B = -4.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9C869C85471703 /* PL0 = +9.896883942603146946483e-17 */ + .quad 0x3FEFFF8C81C6DC33 /* PH0 = +9.999449286177707341139e-01 */ + .quad 0x3F1CDF5A2E4D7C69 /* P1 = +1.101397316012206760643e-04 */ + .quad 0xBF1CDEF1F9BE63BE /* P2 = -1.101336660539594564027e-04 */ + .quad 0x3F133EC10C83AAA0 /* P3 = +7.341435696487731017506e-05 */ + .quad 0xBF033DAB325FAACB /* P4 = -3.669909192168459445238e-05 */ + .quad 0x3EEEC598FA98BAD8 /* P5 = +1.467316890843338172161e-05 */ + .quad 0xBED47F1A15BA368E /* P6 = -4.886744445221253126882e-06 */ + .quad 0x3EB761FBE7D201C1 /* P7 = +1.393720509029845064726e-06 */ + .quad 0xBE974CD75A43BF6B /* P8 = -3.471994551992448536007e-07 */ + .quad 0x3E74B02965BBF8DC /* P9 = +7.706929621914905669946e-08 */ + .quad 0xBE504EF4E3892A66 /* P10 = -1.518840362012570189110e-08 */ + .quad 0xC015000000000000 /* B = -5.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C643810400471B0 /* PL0 = +8.768592603904887599187e-18 */ + .quad 0x3FEFFFD583014825 /* PH0 = +9.999797400180382433987e-01 */ + .quad 0x3F053E71416C43CA /* P1 = +4.051955345663706869871e-05 */ + .quad 0xBF053E550C7C8CC9 /* P2 = -4.051873253121394012080e-05 */ + .quad 0x3EFC52D0D90D4843 /* P3 = +2.701139380018752534477e-05 */ + .quad 0xBEEC523A6ADBE142 /* P4 = -1.350460237457883558350e-05 */ + .quad 0x3ED6A73E22D844B3 /* P5 = +5.400965660055565196396e-06 */ + .quad 0xBEBE31D10F23ACD0 /* P6 = -1.799738182979224868919e-06 */ + .quad 0x3EA13E14264DEAB2 /* P7 = +5.138663935333241981438e-07 */ + .quad 0xBE81385ABB98EDCC /* P8 = -1.282999997786486835638e-07 */ + .quad 0x3E5EB9164593E0B6 /* P9 = +2.861301981891537161158e-08 */ + .quad 0xBE387218CFE7772E /* P10 = -5.691705994073124478195e-09 */ + .quad 0xC017000000000000 /* B = -5.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C92530433F4C703 /* PL0 = +6.357512739163799046861e-17 */ + .quad 0x3FEFFFF05E8D3191 /* PH0 = +9.999925467214315633058e-01 */ + .quad 0x3EEF42DDFA52B575 /* P1 = +1.490650158538873335176e-05 */ + .quad 0xBEEF42CEB54212AA /* P2 = -1.490639048307961378200e-05 */ + .quad 0x3EE4D7201CBCB853 /* P3 = +9.937445518550804010127e-06 */ + .quad 0xBED4D6F764B66C37 /* P4 = -4.968574624976280456686e-06 */ + .quad 0x3EC0ABB806EBDE71 /* P5 = +1.987311456171617620608e-06 */ + .quad 0xBEA6399CF854F876 /* P6 = -6.623581475862682369330e-07 */ + .quad 0x3E8964B91728D7C9 /* P7 = +1.891959403186505598965e-07 */ + .quad 0xBE6961A0528444D6 /* P8 = -4.727645325404986954168e-08 */ + .quad 0x3E46AE3B0814EE00 /* P9 = +1.056147192151514779549e-08 */ + .quad 0xBE221B8194DACD16 /* P10 = -2.107984154277957626641e-09 */ + .quad 0xC019000000000000 /* B = -6.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7BB5622CE1A79E /* PL0 = +2.403331811901679167526e-17 */ + .quad 0x3FEFFFFA3FF22708 /* PH0 = +9.999972580855862602789e-01 */ + .quad 0x3ED7003552D53503 /* P1 = +5.483821309338170039906e-06 */ + .quad 0xBED7003130C1AB92 /* P2 = -5.483806273169366545037e-06 */ + .quad 0x3ECEAAE13B699C45 /* P3 = +3.655850800133043324271e-06 */ + .quad 0xBEBEAACB305F3D07 /* P4 = -1.827905351959291114416e-06 */ + .quad 0x3EA8887F5F9C87EF /* P5 = +7.311461438267648556646e-07 */ + .quad 0xBE905AD08DF8454F /* P6 = -2.437046884027860662692e-07 */ + .quad 0x3E72B068300B703F /* P7 = +6.962228483613086736676e-08 */ + .quad 0xBE52AF921A71C058 /* P8 = -1.740252888706390465423e-08 */ + .quad 0x3E30B53EAA35300D /* P9 = +3.890131469838137725119e-09 */ + .quad 0xBE0AB60CDAD7E22E /* P10 = -7.773963050435300060566e-10 */ + .quad 0xC01B000000000000 /* B = -6.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8BD1ACF80D7256 /* PL0 = +4.825835138930451121169e-17 */ + .quad 0x3FEFFFFDE2760A41 /* PH0 = +9.999989913051835488389e-01 */ + .quad 0x3EC0EC4F1EC27E55 /* P1 = +2.017388615341105998718e-06 */ + .quad 0xBEC0EC4E005E6EAC /* P2 = -2.017386580411626200507e-06 */ + .quad 0x3EB6906504BC4610 /* P3 = +1.344921673533307001969e-06 */ + .quad 0xBEA6905F0D52C8B5 /* P4 = -6.724581235377781360384e-07 */ + .quad 0x3E920D0F5CCE152B /* P5 = +2.689810941136721216499e-07 */ + .quad 0xBE7811505B10E753 /* P6 = -8.965891741619763761543e-08 */ + .quad 0x3E5B811EE4F9B8EE /* P7 = +2.561544781706659619288e-08 */ + .quad 0xBE3B80ABC067E840 /* P8 = -6.403452884688571158579e-09 */ + .quad 0x3E1898E394E09335 /* P9 = +1.431746793613569087489e-09 */ + .quad 0xBDF3ABB5BA711DB7 /* P10 = -2.862469657501951918569e-10 */ + .quad 0xC01D000000000000 /* B = -7.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8AE01DB39A3791 /* PL0 = +4.662147961093911873193e-17 */ + .quad 0x3FEFFFFF38C76668 /* PH0 = +9.999996289217962797125e-01 */ + .quad 0x3EA8E712E56E1188 /* P1 = +7.421562696484951529573e-07 */ + .quad 0xBEA8E7124A650791 /* P2 = -7.421559942504648535596e-07 */ + .quad 0x3EA09A0B62D8EF94 /* P3 = +4.947702955735978541097e-07 */ + .quad 0xBE909A09C56C2107 /* P4 = -2.473847805916120382218e-07 */ + .quad 0x3E7A900A90A54A6E /* P5 = +9.895362410487317236618e-08 */ + .quad 0xBE61B5557BB449B6 /* P6 = -3.298434544432568302770e-08 */ + .quad 0x3E443CC74732CDCA /* P7 = +9.423781066565733462466e-09 */ + .quad 0xBE243CA8AA8D6E54 /* P8 = -2.355890888986360997159e-09 */ + .quad 0x3E0219C341E0D1B4 /* P9 = +5.267978308406275552691e-10 */ + .quad 0xBDDCF49A10950F13 /* P10 = -1.053394074620716018815e-10 */ + .quad 0xC01F000000000000 /* B = -7.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C75CB18F3775414 /* PL0 = +1.890271747518592444083e-17 */ + .quad 0x3FEFFFFFD38C39F0 /* PH0 = +9.999999172012490333827e-01 */ + .quad 0x3E8639E2F89493BB /* P1 = +1.655974950855472979393e-07 */ + .quad 0xBE8639E2D9B29562 /* P2 = -1.655974813708346974914e-07 */ + .quad 0x3E7DA2836A1F706E /* P3 = +1.103982989742589616541e-07 */ + .quad 0xBE6DA282C6733DAE /* P4 = -5.519913131581509871840e-08 */ + .quad 0x3E57B53A278851FD /* P5 = +2.207971980430773309147e-08 */ + .quad 0xBE3F9C4A72536E22 /* P6 = -7.359895614149337484810e-09 */ + .quad 0x3E220E81FBE19CDD /* P7 = +2.102073153607135257714e-09 */ + .quad 0xBE020E8875ADA8D8 /* P8 = -5.255211642212584097407e-10 */ + .quad 0x3DE07634328384FC /* P9 = +1.197748786062966341989e-10 */ + .quad 0xBDBA54078E3C351F /* P10 = -2.394539505021488953905e-11 */ + .quad 0xC021000000000000 /* B = -8.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C98B78738B0EDEF /* PL0 = +8.575399788039081964921e-17 */ + .quad 0x3FEFFFFFF9FBEA40 /* PH0 = +9.999999887944071019774e-01 */ + .quad 0x3E581056FAC28C46 /* P1 = +2.241118550516412682327e-08 */ + .quad 0xBE581056F63A4351 /* P2 = -2.241118525356742542550e-08 */ + .quad 0x3E500AE49533790A /* P3 = +1.494078933911655875521e-08 */ + .quad 0xBE400AE489ACBA90 /* P4 = -7.470394349637968945652e-09 */ + .quad 0x3E29AB0D59A1967B /* P5 = +2.988168557255271725494e-09 */ + .quad 0xBE111CB32D6EEF2B /* P6 = -9.960558400070350772418e-10 */ + .quad 0x3DF38CBADF396908 /* P7 = +2.844859618921805216353e-10 */ + .quad 0xBDD38CC7B92CECD3 /* P8 = -7.112220386749926320915e-11 */ + .quad 0x3DB1D2BBE2705032 /* P9 = +1.621008722427575444686e-11 */ + .quad 0xBD8C8199294E6380 /* P10 = -3.240784656869469020111e-12 */ + .quad 0xC023000000000000 /* B = -9.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8EEEC16618B984 /* PL0 = +5.365957423487855307906e-17 */ + .quad 0x3FEFFFFFFF2F9279 /* PH0 = +9.999999984834878619111e-01 */ + .quad 0x3E2A0DB0D052B148 /* P1 = +3.033024167396880687734e-09 */ + .quad 0xBE2A0DB0CFA6AB71 /* P2 = -3.033024162734192808028e-09 */ + .quad 0x3E215E75D53A3105 /* P3 = +2.022016035353114070618e-09 */ + .quad 0xBE115E75D40AA47F /* P4 = -1.011008013562702155050e-09 */ + .quad 0x3DFBCA5CDC12ED1C /* P5 = +4.044047007631481841556e-10 */ + .quad 0xBDE286E85704FC22 /* P6 = -1.348015410318274576187e-10 */ + .quad 0x3DC52A8925354517 /* P7 = +3.850101197145027796396e-11 */ + .quad 0xBDA52A97EA3F5F4A /* P8 = -9.625355478142550638468e-12 */ + .quad 0x3D834C011A2AC0F7 /* P9 = +2.193802608697321032841e-12 */ + .quad 0xBD5EDD05BDCB3A62 /* P10 = -4.385948508419928563300e-13 */ + .quad 0xC025000000000000 /* B = -10.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6BD8B474BBF792 /* PL0 = +1.207649585364892639612e-17 */ + .quad 0x3FEFFFFFFFE3CAD8 /* PH0 = +9.999999997947623953110e-01 */ + .quad 0x3DFC3527E43C565F /* P1 = +4.104751852963940338559e-10 */ + .quad 0xBDFC3527E420F415 /* P2 = -4.104751852036136216697e-10 */ + .quad 0x3DF2CE1A8D806DAD /* P3 = +2.736501142887952919489e-10 */ + .quad 0xBDE2CE1A8DDF690A /* P4 = -1.368250573053032426141e-10 */ + .quad 0x3DCE169832D8BD68 /* P5 = +5.473022586854025789680e-11 */ + .quad 0xBDB40F0FE853DA5B /* P6 = -1.824340550195944358477e-11 */ + .quad 0x3D96EA8D930D31A1 /* P7 = +5.210545794901128943676e-12 */ + .quad 0xBD76EA9DB0D09839 /* P8 = -1.302650427355019556441e-12 */ + .quad 0x3D54E474FD4303A1 /* P9 = +2.968990047962355000258e-13 */ + .quad 0xBD30B526CA2B228A /* P10 = -5.935740124899435401321e-14 */ + .quad 0xC027000000000000 /* B = -11.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C56E8953D525FD5 /* PL0 = +4.967494994909661698725e-18 */ + .quad 0x3FEFFFFFFFFC2EB9 /* PH0 = +9.999999999722241073030e-01 */ + .quad 0x3DCE8A37A48016C2 /* P1 = +5.555177547354687971427e-11 */ + .quad 0xBDCE8A37A479B7D4 /* P2 = -5.555177547084873157964e-11 */ + .quad 0x3DC45C250CFA9C16 /* P3 = +3.703451575129414499553e-11 */ + .quad 0xBDB45C250D9F8467 /* P4 = -1.851725791056759260154e-11 */ + .quad 0x3DA049BB33CBD4E9 /* P5 = +7.406930640558963265190e-12 */ + .quad 0xBD85B7A407C422C1 /* P6 = -2.468976464832073512208e-12 */ + .quad 0x3D68CF9CED2B3FD5 /* P7 = +7.051706989348171774536e-13 */ + .quad 0xBD48CFAE64C352B3 /* P8 = -1.762945685274427023683e-13 */ + .quad 0x3D269EAE08690D52 /* P9 = +4.018091287355461204663e-14 */ + .quad 0xBD0216CBEAFFF5AA /* P10 = -8.033151495672990022322e-15 */ + .quad 0xC029000000000000 /* B = -12.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8ACF1392B106D3 /* PL0 = +4.650601502940921454330e-17 */ + .quad 0x3FEFFFFFFFFF7BBD /* PH0 = +9.999999999962408958609e-01 */ + .quad 0x3DA088529889B316 /* P1 = +7.518115268189742464885e-12 */ + .quad 0xBDA088529887F4C4 /* P2 = -7.518115268005149164680e-12 */ + .quad 0x3D960B18BF1DF711 /* P3 = +5.012076679213679703380e-12 */ + .quad 0xBD860B18BFD99A48 /* P4 = -2.506038344573564868987e-12 */ + .quad 0x3D71A27E7CA64143 /* P5 = +1.002419056539285288454e-12 */ + .quad 0xBD5783530EA76D91 /* P6 = -3.341396294294381580191e-13 */ + .quad 0x3D3ADCC75CBD2A03 /* P7 = +9.543447641637910477850e-14 */ + .quad 0xBD1ADCDA46BE5F17 /* P8 = -2.385887543769010971872e-14 */ + .quad 0x3CF87D77650BE5B8 /* P9 = +5.437895260471143131391e-15 */ + .quad 0xBCD395AE6E74C6D2 /* P10 = -1.087168847335561258239e-15 */ + .quad 0xC02B000000000000 /* B = -13.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C97A8A295292858 /* PL0 = +8.208271151146829171896e-17 */ + .quad 0x3FEFFFFFFFFFEE19 /* PH0 = +9.999999999994911847878e-01 */ + .quad 0x3D71E642BB008F95 /* P1 = +1.017466259229268282255e-12 */ + .quad 0xBD71E642BAFEEC54 /* P2 = -1.017466259207593392022e-12 */ + .quad 0x3D67DDAE41647741 /* P3 = +6.783108169938233581038e-13 */ + .quad 0xBD57DDAE4230F34B /* P4 = -3.391554091734942426856e-13 */ + .quad 0x3D4317C33FAE2536 /* P5 = +1.356626669455791324801e-13 */ + .quad 0xBD2975040D3E26B9 /* P6 = -4.522088139411435138867e-14 */ + .quad 0x3D0D155DCD0F0AFB /* P7 = +1.291565189902030307333e-14 */ + .quad 0xBCED157247832B20 /* P8 = -3.228947666403019234175e-15 */ + .quad 0x3CCA83D70F607C28 /* P9 = +7.359390959466796619024e-16 */ + .quad 0xBCA5343952C1E19E /* P10 = -1.471323041436694087188e-16 */ + .quad 0xC02D000000000000 /* B = -14.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9B7876CBC5306E /* PL0 = +9.530765996816607711732e-17 */ + .quad 0x3FEFFFFFFFFFFD93 /* PH0 = +9.999999999999310551502e-01 */ + .quad 0x3D436121E2640D76 /* P1 = +1.376990843765503869546e-13 */ + .quad 0xBD436121E26250EA /* P2 = -1.376990843736775811281e-13 */ + .quad 0x3D39D6D7CA259186 /* P3 = +9.179938654047876451320e-14 */ + .quad 0xBD29D6D7CB0327CE /* P4 = -4.589969336188563660531e-14 */ + .quad 0x3D14ABE4DC31244A /* P5 = +1.835994545584345768382e-14 */ + .quad 0xBCFB8FDB82AB6BB7 /* P6 = -6.119980791767901275443e-15 */ + .quad 0x3CDF7CF757491B60 /* P7 = +1.747943407988343076526e-15 */ + .quad 0xBCBF7D0D833640FB /* P8 = -4.369905470133249448357e-16 */ + .quad 0x3C9CB512F6BDC754 /* P9 = +9.959852600692493655511e-17 */ + .quad 0xBC76F50AB1B0E9BA /* P10 = -1.991219205936492089091e-17 */ + .quad 0xC02F000000000000 /* B = -15.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6FFE15D5F78543 /* PL0 = +1.387454417328248962819e-17 */ + .quad 0x3FEFFFFFFFFFFFE1 /* PH0 = +9.999999999999965583086e-01 */ + .quad 0x3CFEE00288B99C26 /* P1 = +6.855635762864742358597e-15 */ + .quad 0xBCFEE0027D060EE2 /* P2 = -6.855635607998342735403e-15 */ + .quad 0x3CF4954AA23148A2 /* P3 = +4.570381865813341696777e-15 */ + .quad 0xBCE4954B5DAD3010 /* P4 = -2.285192173571711474199e-15 */ + .quad 0x3CD07883DD8793BD /* P5 = +9.143109661358222028007e-16 */ + .quad 0xBCB5F5F4BB87ADCF /* P6 = -3.047668447080103869032e-16 */ + .quad 0x3C98F1A905097685 /* P7 = +8.654183371862458774513e-17 */ + .quad 0xBC78F2D585007222 /* P8 = -2.163943551222030413627e-17 */ + .quad 0x3C58A37CC5082B5F /* P9 = +5.342649626494471588064e-18 */ + .quad 0xBC33AE7917F94D17 /* P10 = -1.066938163384541013918e-18 */ + .quad 0xC031000000000000 /* B = -17 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C91BF1D80474F0F /* PL0 = +6.157069264461989135096e-17 */ + .quad 0x3FEFFFFFFFFFFFFE /* PH0 = +9.999999999999997779554e-01 */ + .quad 0x3CB72071400E6275 /* P1 = +3.209478247225075961360e-16 */ + .quad 0xBCB72071400A9F37 /* P2 = -3.209478247103497434502e-16 */ + .quad 0x3CAED5EC39A77629 /* P3 = +2.139652050028423711308e-16 */ + .quad 0xBC9ED5EC3B530600 /* P4 = -1.069826028468029104719e-16 */ + .quad 0x3C88AB2BFED159DE /* P5 = +4.279326904335078988705e-17 */ + .quad 0xBC70721D1220B3FC /* P6 = -1.426441958074916244382e-17 */ + .quad 0x3C52C96049721FB8 /* P7 = +4.073700029965821523731e-18 */ + .quad 0xBC32C971215735DC /* P8 = -1.018438939975201710113e-18 */ + .quad 0x3C112EF658AB41A9 /* P9 = +2.328791246104218830028e-19 */ + .quad 0xBBEB7B598C6AD3DE /* P10 = -4.655603964908654142787e-20 */ + .quad 0xC03287E0C98F84E5 /* B = -18.530774 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* PL0 = +0.000000000000000000000e-01 */ + .quad 0x3FF0000000000000 /* PH0 = +1.000000000000000000000e+00 */ + .quad 0x0000000000000000 /* P1 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P2 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P3 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P4 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P5 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P6 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P7 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P8 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P9 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P10 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* B = +0 */ + .quad 0x0000000000000000 /* A = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .align 16 + .quad 0x8000000000000000, 0x8000000000000000 /* _dbSignMask */ + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff /* _dbAbsMask */ + .align 16 + .long 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000 /* _iExpMantMask */ + .align 16 + .long 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000 /* _iExpMask */ + .align 16 + .long 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000 /* _iMinIdxOfsMask */ + .align 16 + .long 0x00760000, 0x00760000, 0x00760000, 0x00760000 /* _iMaxIdxMask */ + .align 16 + .type __svml_dtanh_data_internal,@object + .size __svml_dtanh_data_internal,.-__svml_dtanh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core-sse.S new file mode 100644 index 0000000000..80e85c47ec --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized tanh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_tanh _ZGVdN4v_tanh_sse_wrapper +#include "../svml_d_tanh4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core.c new file mode 100644 index 0000000000..a26e62052b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized tanh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_tanh +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_tanh, __GI__ZGVdN4v_tanh, __redirect__ZGVdN4v_tanh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core_avx2.S new file mode 100644 index 0000000000..53dda241e4 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh4_core_avx2.S @@ -0,0 +1,1279 @@ +/* Function tanh vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * NOTE: Since the hyperbolic tangent function is odd + * (tanh(x) = -tanh(-x)), below algorithm deals with the absolute + * value of the argument |x|: tanh(x) = sign(x) * tanh(|x|) + * + * We use a table lookup method to compute tanh(|x|). + * The basic idea is to split the input range into a number of subintervals + * and to approximate tanh(.) with a polynomial on each of them. + * + * IEEE SPECIAL CONDITIONS: + * x = [+,-]0, r = [+,-]0 + * x = +Inf, r = +1 + * x = -Inf, r = -1 + * x = QNaN, r = QNaN + * x = SNaN, r = QNaN + * + * + * ALGORITHM DETAILS + * We handle special values in a callout function, aside from main path + * computations. "Special" for this algorithm are: + * INF, NAN, |x| > HUGE_THRESHOLD + * + * + * Main path computations are organized as follows: + * Actually we split the interval [0, SATURATION_THRESHOLD) + * into a number of subintervals. On each subinterval we approximate tanh(.) + * with a minimax polynomial of pre-defined degree. Polynomial coefficients + * are computed beforehand and stored in table. We also use + * + * y := |x| + B, + * + * here B depends on subinterval and is used to make argument + * closer to zero. + * We also add large fake interval [SATURATION_THRESHOLD, HUGE_THRESHOLD], + * where 1.0 + 0.0*y + 0.0*y^2 ... coefficients are stored - just to + * preserve main path computation logic but return 1.0 for all arguments. + * + * Hence reconstruction looks as follows: + * we extract proper polynomial and range reduction coefficients + * (Pj and B), corresponding to subinterval, to which |x| belongs, + * and return + * + * r := sign(x) * (P0 + P1 * y + ... + Pn * y^n) + * + * NOTE: we use multiprecision technique to multiply and sum the first + * K terms of the polynomial. So Pj, j = 0..K are stored in + * table each as a pair of target precision numbers (Pj and PLj) to + * achieve wider than target precision. + * + * + */ + +/* Offsets for data table __svml_dtanh_data_internal + */ +#define _dbP 0 +#define _dbSignMask 7680 +#define _dbAbsMask 7712 +#define _iExpMantMask 7744 +#define _iExpMask 7776 +#define _iMinIdxOfsMask 7808 +#define _iMaxIdxMask 7840 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_tanh_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea _dbP+96+__svml_dtanh_data_internal(%rip), %r8 + vmovupd %ymm0, (%rsp) + +/* if VMIN, VMAX is defined for I type */ + vpxor %xmm11, %xmm11, %xmm11 + +/* Constant loading */ + vmovups _iMaxIdxMask+__svml_dtanh_data_internal(%rip), %xmm8 + vandpd _dbAbsMask+__svml_dtanh_data_internal(%rip), %ymm0, %ymm1 + vandpd _dbSignMask+__svml_dtanh_data_internal(%rip), %ymm0, %ymm2 + vextractf128 $1, %ymm0, %xmm15 + vshufps $221, %xmm15, %xmm0, %xmm14 + +/* Here huge arguments, INF and NaNs are filtered out to callout. */ + vpand _iExpMantMask+__svml_dtanh_data_internal(%rip), %xmm14, %xmm12 + vpsubd _iMinIdxOfsMask+__svml_dtanh_data_internal(%rip), %xmm12, %xmm9 + vpcmpgtd %xmm11, %xmm9, %xmm10 + vpcmpgtd %xmm8, %xmm9, %xmm0 + vpand %xmm10, %xmm9, %xmm7 + blendvps %xmm0, %xmm8, %xmm7 + +/* + * VSHRIMM( I, iIndex, = iIndex, (17 - 4) ); + * VGATHER_MATRIX( L2D, p, TAB._dbP, iIndex, 0, T_ITEM_SIZE, T_ITEM_GRAN, 13, 0, 0 ); + */ + vpsrld $10, %xmm7, %xmm6 + vmovd %xmm6, %edx + vpcmpgtd _iExpMask+__svml_dtanh_data_internal(%rip), %xmm12, %xmm13 + vmovmskps %xmm13, %eax + vpextrd $1, %xmm6, %ecx + movslq %edx, %rdx + movslq %ecx, %rcx + vpextrd $2, %xmm6, %esi + vpextrd $3, %xmm6, %edi + movslq %esi, %rsi + movslq %edi, %rdi + vmovupd -96(%rdx,%r8), %xmm3 + vmovupd -96(%rcx,%r8), %xmm4 + vmovupd -80(%rcx,%r8), %xmm13 + vmovupd -64(%rcx,%r8), %xmm9 + vmovupd -80(%rdx,%r8), %xmm14 + vmovupd -64(%rdx,%r8), %xmm10 + vmovupd -48(%rdx,%r8), %xmm6 + vinsertf128 $1, -96(%rsi,%r8), %ymm3, %ymm0 + vinsertf128 $1, -96(%rdi,%r8), %ymm4, %ymm15 + vmovupd -48(%rcx,%r8), %xmm3 + vunpckhpd %ymm15, %ymm0, %ymm0 + vinsertf128 $1, -80(%rsi,%r8), %ymm14, %ymm12 + vinsertf128 $1, -64(%rsi,%r8), %ymm10, %ymm8 + vinsertf128 $1, -80(%rdi,%r8), %ymm13, %ymm11 + vinsertf128 $1, -64(%rdi,%r8), %ymm9, %ymm7 + vunpcklpd %ymm11, %ymm12, %ymm15 + vunpckhpd %ymm11, %ymm12, %ymm14 + vunpcklpd %ymm7, %ymm8, %ymm13 + vunpckhpd %ymm7, %ymm8, %ymm12 + vmovupd -32(%rdx,%r8), %xmm9 + vmovupd -32(%rcx,%r8), %xmm8 + vinsertf128 $1, -48(%rsi,%r8), %ymm6, %ymm4 + vinsertf128 $1, -48(%rdi,%r8), %ymm3, %ymm5 + vunpcklpd %ymm5, %ymm4, %ymm11 + vunpckhpd %ymm5, %ymm4, %ymm10 + vmovupd -16(%rdx,%r8), %xmm3 + vmovupd -16(%rcx,%r8), %xmm4 + vinsertf128 $1, -32(%rsi,%r8), %ymm9, %ymm7 + vinsertf128 $1, -32(%rdi,%r8), %ymm8, %ymm6 + vunpcklpd %ymm6, %ymm7, %ymm9 + vunpckhpd %ymm6, %ymm7, %ymm8 + vinsertf128 $1, -16(%rsi,%r8), %ymm3, %ymm5 + vinsertf128 $1, -16(%rdi,%r8), %ymm4, %ymm6 + vunpcklpd %ymm6, %ymm5, %ymm7 + vunpckhpd %ymm6, %ymm5, %ymm6 + vmovupd (%rdx,%r8), %xmm3 + vmovupd (%rcx,%r8), %xmm5 + vinsertf128 $1, (%rsi,%r8), %ymm3, %ymm4 + vinsertf128 $1, (%rdi,%r8), %ymm5, %ymm5 + vunpcklpd %ymm5, %ymm4, %ymm3 + vaddpd %ymm3, %ymm1, %ymm1 + vfmadd213pd %ymm7, %ymm1, %ymm6 + vfmadd213pd %ymm8, %ymm1, %ymm6 + vfmadd213pd %ymm9, %ymm1, %ymm6 + vfmadd213pd %ymm10, %ymm1, %ymm6 + vfmadd213pd %ymm11, %ymm1, %ymm6 + vfmadd213pd %ymm12, %ymm1, %ymm6 + vfmadd213pd %ymm13, %ymm1, %ymm6 + vfmadd213pd %ymm14, %ymm1, %ymm6 + vfmadd213pd %ymm15, %ymm1, %ymm6 + vfmadd213pd %ymm0, %ymm1, %ymm6 + vorpd %ymm2, %ymm6, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd (%rsp), %ymm1 + vmovupd %ymm0, 64(%rsp) + vmovupd %ymm1, 32(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call tanh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_tanh_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dtanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dbP[60*16][2]; + __declspec(align(32)) VUINT32 _dbSignMask[4][2]; + __declspec(align(32)) VUINT32 _dbAbsMask[4][2]; + __declspec(align(32)) VUINT32 _iExpMantMask[8][1]; + __declspec(align(32)) VUINT32 _iExpMask[8][1]; + __declspec(align(32)) VUINT32 _iMinIdxOfsMask[8][1]; + __declspec(align(32)) VUINT32 _iMaxIdxMask[8][1]; +} __svml_dtanh_data_internal; +#endif +__svml_dtanh_data_internal: + /* Polynomial coefficients */ + .quad 0x0000000000000000 /* PL0 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* PH0 = +0.000000000000000000000e-01 */ + .quad 0x3FF0000000000000 /* P1 = +1.000000000000000014103e+00 */ + .quad 0xBD197DEAD79668D3 /* P2 = -2.264132406596103056796e-14 */ + .quad 0xBFD555555553AF3C /* P3 = -3.333333333273349741024e-01 */ + .quad 0xBE052F7CCA134846 /* P4 = -6.165791385711493738399e-10 */ + .quad 0x3FC11111563849D6 /* P5 = +1.333333655353061107201e-01 */ + .quad 0xBEB038623673FFB2 /* P6 = -9.668021563879858950855e-07 */ + .quad 0xBFAB9F685E64022E /* P7 = -5.395055916051593179252e-02 */ + .quad 0xBF2A54E2B28F2207 /* P8 = -2.008940439550829012647e-04 */ + .quad 0x3F97CFB9328A230E /* P9 = +2.325333949059698582189e-02 */ + .quad 0xBF75CA6D61723E02 /* P10 = -5.320002811586290441790e-03 */ + .quad 0x0000000000000000 /* B = +0 */ + .quad 0x3FF0000000000000 /* A = +1.0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C3708A564FAD29A /* PL0 = +1.248663375337163807466e-18 */ + .quad 0x3FC0E6973998DA48 /* PH0 = +1.320370703922029154143e-01 */ + .quad 0x3FEF712EB25C0888 /* P1 = +9.825662120422444519229e-01 */ + .quad 0xBFC09B296F7C1EA9 /* P2 = -1.297351641044220078331e-01 */ + .quad 0xBFD3DD77541EDDA7 /* P3 = -3.103922196855485849143e-01 */ + .quad 0x3FB58FFCF4309615 /* P4 = +8.422833406128689275566e-02 */ + .quad 0x3FBD3ABE845DCF49 /* P5 = +1.141776154670967208833e-01 */ + .quad 0xBFA791DF538C37FA /* P6 = -4.603479285115947936529e-02 */ + .quad 0xBFA4F872F69CD6E8 /* P7 = -4.095801601799370195284e-02 */ + .quad 0x3F9772E49EF6412B /* P8 = +2.289921970583567527179e-02 */ + .quad 0x3F8CBC0807393909 /* P9 = +1.403051635784581776625e-02 */ + .quad 0xBF85F06A30F93319 /* P10 = -1.071246110873285040939e-02 */ + .quad 0xBFC1000000000000 /* B = -.132813 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6004EE5739DEAC /* PL0 = +6.947247374112211856530e-18 */ + .quad 0x3FC2DC968E6E0D62 /* PH0 = +1.473568149050193398786e-01 */ + .quad 0x3FEF4E1E606D96DF /* P1 = +9.782859691010478680677e-01 */ + .quad 0xBFC273BD70994AB9 /* P2 = -1.441571044730005866646e-01 */ + .quad 0xBFD382B548270D2C /* P3 = -3.048527912726111386771e-01 */ + .quad 0x3FB7CD2D582A6B29 /* P4 = +9.297450449450351894400e-02 */ + .quad 0x3FBC1278CCCBF0DB /* P5 = +1.096568584434324642303e-01 */ + .quad 0xBFA9C7F5115B86A1 /* P6 = -5.035367810138536095866e-02 */ + .quad 0xBFA371C21BAF618E /* P7 = -3.797728145554222910481e-02 */ + .quad 0x3F9958943F68417E /* P8 = +2.475196492201935923783e-02 */ + .quad 0x3F8930D5CFFD4152 /* P9 = +1.230017701132682667572e-02 */ + .quad 0xBF875CF7ADD31B76 /* P10 = -1.140779017658897660092e-02 */ + .quad 0xBFC3000000000000 /* B = -.148438 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7EABE24E052A1F /* PL0 = +2.660321779421749543501e-17 */ + .quad 0x3FC4D04783618C71 /* PH0 = +1.626061812886266111366e-01 */ + .quad 0x3FEF2765AF97A4B3 /* P1 = +9.735592298067302883212e-01 */ + .quad 0xBFC443654205FEA5 /* P2 = -1.583067486171689074207e-01 */ + .quad 0xBFD31F2E208A5B97 /* P3 = -2.987780874040536844467e-01 */ + .quad 0x3FB9F235BD339878 /* P4 = +1.013520800512156573576e-01 */ + .quad 0x3FBAD0B0DFCCA141 /* P5 = +1.047468706498238100104e-01 */ + .quad 0xBFABD1B9600E608E /* P6 = -5.433444306908184548967e-02 */ + .quad 0xBFA1CEBEAF07DB58 /* P7 = -3.478046309094534453598e-02 */ + .quad 0x3F9AFC9FB1D8EFD2 /* P8 = +2.635430834764902126383e-02 */ + .quad 0x3F8573444F1AB502 /* P9 = +1.047376028449287564018e-02 */ + .quad 0xBF8874FBC8F24406 /* P10 = -1.194187838544459322219e-02 */ + .quad 0xBFC5000000000000 /* B = -.164063 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7FB199D361A790 /* PL0 = +2.748994907060158996213e-17 */ + .quad 0x3FC6C170259E21F7 /* PH0 = +1.777782615356639783766e-01 */ + .quad 0x3FEEFD17479F7C65 /* P1 = +9.683948897253570478266e-01 */ + .quad 0xBFC609530FE4DF8D /* P2 = -1.721595599753950294577e-01 */ + .quad 0xBFD2B3465D71B4DE /* P3 = -2.921920692959484052676e-01 */ + .quad 0x3FBBFD2D34AC509B /* P4 = +1.093319181057403192166e-01 */ + .quad 0x3FB9778C3C16A0FE /* P5 = +9.948040453912551395183e-02 */ + .quad 0xBFADAC4D9E63C665 /* P6 = -5.795519407719210697372e-02 */ + .quad 0xBFA0139CCAD02D60 /* P7 = -3.139963126894929339124e-02 */ + .quad 0x3F9C5BF43BA6F19D /* P8 = +2.769452680671379432854e-02 */ + .quad 0x3F8190B703350341 /* P9 = +8.576803002712575184772e-03 */ + .quad 0xBF8936606782858A /* P10 = -1.231074634444230850234e-02 */ + .quad 0xBFC7000000000000 /* B = -.179688 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6A917CA3624D50 /* PL0 = +1.152216693509785660691e-17 */ + .quad 0x3FC8AFD7B974FABB /* PH0 = +1.928662925292508878439e-01 */ + .quad 0x3FEECF47624A5D03 /* P1 = +9.628025932060214187231e-01 */ + .quad 0xBFC7C4C2CB4FDE4D /* P2 = -1.856921665891938814679e-01 */ + .quad 0xBFD23F69CB2C1F9D /* P3 = -2.851204380135586155453e-01 */ + .quad 0x3FBDEC5703A03814 /* P4 = +1.168875106670557712458e-01 */ + .quad 0x3FB8095003D0CF15 /* P5 = +9.389209836154706616487e-02 */ + .quad 0xBFAF554B47B10CBB /* P6 = -6.119761705533607365968e-02 */ + .quad 0xBF9C89743FE7BC1B /* P7 = -2.786809577986213853937e-02 */ + .quad 0x3F9D74725B746E7C /* P8 = +2.876452143855921824991e-02 */ + .quad 0x3F7B2D8AFB70B88C /* P9 = +6.635229968237631511880e-03 */ + .quad 0xBF89A0A2883EF6CB /* P10 = -1.251341799058582545252e-02 */ + .quad 0xBFC9000000000000 /* B = -.195313 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7608279E8609CB /* PL0 = +1.910958764623660748269e-17 */ + .quad 0x3FCA9B46D2DDC5E3 /* PH0 = +2.078636674519166172015e-01 */ + .quad 0x3FEE9E0BB72A01A1 /* P1 = +9.567926957534390123919e-01 */ + .quad 0xBFC974FAD10C5330 /* P2 = -1.988824387305156976885e-01 */ + .quad 0xBFD1C40ACCBA4044 /* P3 = -2.775904654781735703430e-01 */ + .quad 0x3FBFBE24E2987853 /* P4 = +1.239951184474830487522e-01 */ + .quad 0x3FB6885B4345E47F /* P5 = +8.801813499839460539687e-02 */ + .quad 0xBFB06563D5670584 /* P6 = -6.404708824176991770896e-02 */ + .quad 0xBF98CD1D620DF6E2 /* P7 = -2.421995078065365147772e-02 */ + .quad 0x3F9E44EF3E844D21 /* P8 = +2.955983943054463683119e-02 */ + .quad 0x3F7325FA0148CAAE /* P9 = +4.674889165971292322643e-03 */ + .quad 0xBF89B4C8556C2D92 /* P10 = -1.255184660614964011319e-02 */ + .quad 0xBFCB000000000000 /* B = -.210938 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6F19DAA20F51D5 /* PL0 = +1.348790537832000351176e-17 */ + .quad 0x3FCC83876CA98E15 /* PH0 = +2.227639465883021474557e-01 */ + .quad 0x3FEE697B662D07CD /* P1 = +9.503762241004040620296e-01 */ + .quad 0xBFCB194C7ED76ACF /* P2 = -2.117095584242946953999e-01 */ + .quad 0xBFD141A19E419762 /* P3 = -2.696308179350720680191e-01 */ + .quad 0x3FC0B89C64BC7B98 /* P4 = +1.306338779331468503007e-01 */ + .quad 0x3FB4F721150BBFC5 /* P5 = +8.189589275184434216748e-02 */ + .quad 0xBFB105AAFAB87898 /* P6 = -6.649273511036069461061e-02 */ + .quad 0xBF94FB3B31248C01 /* P7 = -2.048962104266749732921e-02 */ + .quad 0x3F9ECD31E588709C /* P8 = +3.007963145692880855964e-02 */ + .quad 0x3F664A91A335C105 /* P9 = +2.721104095762541127495e-03 */ + .quad 0xBF89754E32E1E26E /* P10 = -1.243077366619723806134e-02 */ + .quad 0xBFCD000000000000 /* B = -.226563 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6AC6C889D8111D /* PL0 = +1.161245469312620769170e-17 */ + .quad 0x3FCE6864FE55A3D0 /* PH0 = +2.375608674877001114112e-01 */ + .quad 0x3FEE31AEE116B82B /* P1 = +9.435648342384913826391e-01 */ + .quad 0xBFCCB114B69E808B /* P2 = -2.241540805525839833707e-01 */ + .quad 0xBFD0B8AB913BA99D /* P3 = -2.612713735858507980441e-01 */ + .quad 0x3FC1823322BED48A /* P4 = +1.367858810096190233514e-01 */ + .quad 0x3FB35822B7929893 /* P5 = +7.556359273675842651653e-02 */ + .quad 0xBFB18B03CC78D2DA /* P6 = -6.852744810096158580830e-02 */ + .quad 0xBF911CCC3C8D5E5D /* P7 = -1.671141738492420009734e-02 */ + .quad 0x3F9F0DEC2D99B12F /* P8 = +3.032654789278515819797e-02 */ + .quad 0x3F4A28398B4EBD98 /* P9 = +7.982521989244205404918e-04 */ + .quad 0xBF88E60CB2FAB9A4 /* P10 = -1.215753480150000985458e-02 */ + .quad 0xBFCF000000000000 /* B = -.242188 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C89D2B6774FB61D /* PL0 = +4.479593208720169247958e-17 */ + .quad 0x3FD09C744F539BE4 /* PH0 = +2.595492148088267558848e-01 */ + .quad 0x3FEDD823B0400D42 /* P1 = +9.326342050921214825882e-01 */ + .quad 0xBFCEFBF7FF305FCC /* P2 = -2.420644756355144687086e-01 */ + .quad 0xBFCFC01DC4F24A41 /* P3 = -2.480504237797323303990e-01 */ + .quad 0x3FC291A2C26D5548 /* P4 = +1.450694512701977626753e-01 */ + .quad 0x3FB0D562E672D188 /* P5 = +6.575601698097532991976e-02 */ + .quad 0xBFB2201ECC119E06 /* P6 = -7.080261690281738261872e-02 */ + .quad 0xBF8695D50F778D31 /* P7 = -1.102796987010509974642e-02 */ + .quad 0x3F9EEC8CFBC031A0 /* P8 = +3.019924437107734972427e-02 */ + .quad 0xBF6030F0A4D3660A /* P9 = -1.976461417694923328722e-03 */ + .quad 0xBF87845288A4AEF5 /* P10 = -1.148285369398347838494e-02 */ + .quad 0xBFD1000000000000 /* B = -.265625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8B6AAB614D1C8D /* PL0 = +4.756035418366735312727e-17 */ + .quad 0x3FD275F7E1CF7F63 /* PH0 = +2.884502129727392616410e-01 */ + .quad 0x3FED56658F74C9CC /* P1 = +9.167964746359813351341e-01 */ + .quad 0xBFD0ECC045EBD596 /* P2 = -2.644501383614054083635e-01 */ + .quad 0xBFCD5A4BDE179180 /* P3 = -2.293181261476426808811e-01 */ + .quad 0x3FC3C00047D34767 /* P4 = +1.542969084462655120552e-01 */ + .quad 0x3FAAC7CE84FD609F /* P5 = +5.230565427217581251974e-02 */ + .quad 0xBFB288948D2E8B43 /* P6 = -7.239654967137902384931e-02 */ + .quad 0xBF6D6605AAD5A1C0 /* P7 = -3.588687008847041164896e-03 */ + .quad 0x3F9DDB0790848E97 /* P8 = +2.915584392134337382866e-02 */ + .quad 0xBF75FDE291BAD5B4 /* P9 = -5.369076763306269573660e-03 */ + .quad 0xBF84CEA5C52E0A78 /* P10 = -1.015977390284671071888e-02 */ + .quad 0xBFD3000000000000 /* B = -.296875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7139A81C8A6ECF /* PL0 = +1.494049799478574591322e-17 */ + .quad 0x3FD4470650036407 /* PH0 = +3.168350011233659890841e-01 */ + .quad 0x3FECC9A69DFDDD48 /* P1 = +8.996155820631566629678e-01 */ + .quad 0xBFD23DED3A37A09F /* P2 = -2.850297039535778028925e-01 */ + .quad 0xBFCAD302395D51C1 /* P3 = -2.095644741153943890185e-01 */ + .quad 0x3FC4A8FE3F309C22 /* P4 = +1.614072617096278705115e-01 */ + .quad 0x3FA3D161188AA436 /* P5 = +3.870681213931741151586e-02 */ + .quad 0xBFB288CFE5494E98 /* P6 = -7.240008685885823969403e-02 */ + .quad 0x3F6C7903EED8D334 /* P7 = +3.475673371918475361081e-03 */ + .quad 0x3F9BE023CDFB02F6 /* P8 = +2.722221321778569498033e-02 */ + .quad 0xBF80F8296F2C3A95 /* P9 = -8.285831170295390358336e-03 */ + .quad 0xBF8152DF4790049B /* P10 = -8.458847400108650973189e-03 */ + .quad 0xBFD5000000000000 /* B = -.328125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7751FE0FEE8335 /* PL0 = +2.022712113430213599928e-17 */ + .quad 0x3FD60EF7120502A9 /* PH0 = +3.446633983585721261456e-01 */ + .quad 0x3FEC32D951E56E6F /* P1 = +8.812071418319202070776e-01 */ + .quad 0xBFD370255FC004F8 /* P2 = -3.037198481616338996824e-01 */ + .quad 0xBFC832F0EBC6BB41 /* P3 = -1.890545989276351359107e-01 */ + .quad 0x3FC54C99A0FF432F /* P4 = +1.664001499289269127540e-01 */ + .quad 0x3F99DAC0CC283C18 /* P5 = +2.524853941036661688369e-02 */ + .quad 0xBFB227B3896A026D /* P6 = -7.091829399906553280461e-02 */ + .quad 0x3F84663364E1FB19 /* P7 = +9.960557476231411602383e-03 */ + .quad 0x3F9922D70DE07C57 /* P8 = +2.454696676442965935283e-02 */ + .quad 0xBF85C4A4EB6F86BC /* P9 = -1.062897532932837635222e-02 */ + .quad 0xBF7AAB61214FFE17 /* P10 = -6.511096396024671890972e-03 */ + .quad 0xBFD7000000000000 /* B = -.359375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3BFE67F266843B2C /* PL0 = +1.030196791298162288777e-19 */ + .quad 0x3FD7CD3115FC0F16 /* PH0 = +3.718989100163850869407e-01 */ + .quad 0x3FEB92F96CCC2C5B /* P1 = +8.616912007286247079761e-01 */ + .quad 0xBFD4827320135092 /* P2 = -3.204620183216856200247e-01 */ + .quad 0xBFC582B15550168A /* P3 = -1.680509249273891977521e-01 */ + .quad 0x3FC5AC3B9A2E4C31 /* P4 = +1.693186285816366254244e-01 */ + .quad 0x3F88FA599FCADAFB /* P5 = +1.219625491044728129762e-02 */ + .quad 0xBFB16EC8F5CA169E /* P6 = -6.809669495313605642174e-02 */ + .quad 0x3F90140EFC748BBE /* P7 = +1.570151725639922719844e-02 */ + .quad 0x3F95CFC49C1A28DC /* P8 = +2.130038454792147768770e-02 */ + .quad 0xBF8946ED8B1BF454 /* P9 = -1.234231549050882816697e-02 */ + .quad 0xBF7239E55C1DD50F /* P10 = -4.449745117985472755606e-03 */ + .quad 0xBFD9000000000000 /* B = -.390625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6412330191189C /* PL0 = +8.704448096175471149661e-18 */ + .quad 0x3FD9812B3B03F0A5 /* PH0 = +3.985088421175169703936e-01 */ + .quad 0x3FEAEB08C3C0E84D /* P1 = +8.411907027541559254748e-01 */ + .quad 0xBFD57446B1BC46CF /* P2 = -3.352219329545790787820e-01 */ + .quad 0xBFC2CA9ABC0444AD /* P3 = -1.468079965639267634401e-01 */ + .quad 0x3FC5CA95F9460D18 /* P4 = +1.702449290424759093710e-01 */ + .quad 0xBF2C2DAA35DD05C3 /* P5 = -2.149839664813813012186e-04 */ + .quad 0xBFB069A516EEB75D /* P6 = -6.411201295733578195472e-02 */ + .quad 0x3F9512716416FDC7 /* P7 = +2.057816670798986720058e-02 */ + .quad 0x3F921630CB1319A3 /* P8 = +1.766277541607908852593e-02 */ + .quad 0xBF8B76DA2EC99526 /* P9 = -1.341028647693549562145e-02 */ + .quad 0xBF63A97474A161E4 /* P10 = -2.400138332671485493040e-03 */ + .quad 0xBFDB000000000000 /* B = -.421875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C89B79F5783381C /* PL0 = +4.461236087774530799537e-17 */ + .quad 0x3FDB2A6C993B829D /* PH0 = +4.244643684778937609003e-01 */ + .quad 0x3FEA3C0C1FBA328C /* P1 = +8.198299998926627915155e-01 */ + .quad 0xBFD6457212F78DE0 /* P2 = -3.479886231636708581604e-01 */ + .quad 0xBFC0129BDA380A66 /* P3 = -1.255678954622282824818e-01 */ + .quad 0x3FC5AB77F388FBDE /* P4 = +1.692953051696965507089e-01 */ + .quad 0xBF8822F3A6CADB7C /* P5 = -1.178541519889874597783e-02 */ + .quad 0xBFAE4A876370A4BD /* P6 = -5.916236008517603590739e-02 */ + .quad 0x3F991A89BC3B7710 /* P7 = +2.451529704455085335710e-02 */ + .quad 0x3F8C4A4328204D4B /* P8 = +1.381351915555364098800e-02 */ + .quad 0xBF8C5F921D01EC0B /* P9 = -1.385416174911393178490e-02 */ + .quad 0xBF3EE844C5B79FB8 /* P10 = -4.716079617694784908234e-04 */ + .quad 0xBFDD000000000000 /* B = -.453125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C73FA437AD7AD87 /* PL0 = +1.732779905745858845932e-17 */ + .quad 0x3FDCC88C9902CF45 /* PH0 = +4.497405523536495697279e-01 */ + .quad 0x3FE9870845162D1D /* P1 = +7.977334355686341748810e-01 */ + .quad 0xBFD6F62358F73DA8 /* P2 = -3.587730759436120677668e-01 */ + .quad 0xBFBAC4345D675FE1 /* P3 = -1.045563438450467661101e-01 */ + .quad 0x3FC5539DA8287019 /* P4 = +1.666142531474868131862e-01 */ + .quad 0xBF96E3E0DC04A09F /* P5 = -2.235366194614185212822e-02 */ + .quad 0xBFAB5EC7147C207D /* P6 = -5.345747113284546871398e-02 */ + .quad 0x3F9C24166FFA7A58 /* P7 = +2.748141344511120915667e-02 */ + .quad 0x3F8451B907819844 /* P8 = +9.921498815128277696693e-03 */ + .quad 0xBF8C1C6D19191FCB /* P9 = -1.372609360545586670239e-02 */ + .quad 0x3F547372DF72E35A /* P10 = +1.248228245272117756098e-03 */ + .quad 0xBFDF000000000000 /* B = -.484375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C848FE06EE49950 /* PL0 = +3.566941590788961528958e-17 */ + .quad 0x3FDF20211A36475D /* PH0 = +4.863360172249622803697e-01 */ + .quad 0x3FE86E67E6B80AC2 /* P1 = +7.634772783497611574659e-01 */ + .quad 0xBFD7C37C55474D9B /* P2 = -3.713064987943767913461e-01 */ + .quad 0xBFB2EBF15F3CB036 /* P3 = -7.391270232318521952684e-02 */ + .quad 0x3FC4718C8EF6E3AA /* P4 = +1.597152422016539530950e-01 */ + .quad 0xBFA277F8394E9B07 /* P5 = -3.607154559658991932071e-02 */ + .quad 0xBFA680312AB207E3 /* P6 = -4.394677778419955009224e-02 */ + .quad 0x3F9EDC9A8B57E286 /* P7 = +3.013841128810892143223e-02 */ + .quad 0x3F71B8C5E648EAF6 /* P8 = +4.326603932492947851719e-03 */ + .quad 0xBF89DB218356730C /* P9 = -1.262499029217558458029e-02 */ + .quad 0x3F6B05728E6EBC8E /* P10 = +3.298496001171330815865e-03 */ + .quad 0xBFE1000000000000 /* B = -.53125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8429831EDD94DE /* PL0 = +3.497576705878673192147e-17 */ + .quad 0x3FE10AF47E0BF610 /* PH0 = +5.325872861719194162333e-01 */ + .quad 0x3FE6EC5879F87EEE /* P1 = +7.163507826080299761242e-01 */ + .quad 0xBFD86AD001BFE200 /* P2 = -3.815193192563413204129e-01 */ + .quad 0xBFA239045B661385 /* P3 = -3.559125533778398983564e-02 */ + .quad 0x3FC2B4572D9CC147 /* P4 = +1.461285565105845078038e-01 */ + .quad 0xBFA99F4F01740705 /* P5 = -5.004355328311586406115e-02 */ + .quad 0xBF9F449C484F4879 /* P6 = -3.053516570418721511214e-02 */ + .quad 0x3F9F5F42169D7DDE /* P7 = +3.063681853325116830798e-02 */ + .quad 0xBF6111B1BA632A97 /* P8 = -2.083632588527460989469e-03 */ + .quad 0xBF84725FBE5B6E61 /* P9 = -9.983776089419639342530e-03 */ + .quad 0x3F7438A2986CFA9C /* P10 = +4.936823976832951342488e-03 */ + .quad 0xBFE3000000000000 /* B = -.59375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6BE9160BFB3505 /* PL0 = +1.210424670976053242391e-17 */ + .quad 0x3FE26D76F73233C7 /* PH0 = +5.758623912857893101247e-01 */ + .quad 0x3FE56363B5B93937 /* P1 = +6.683825063026124740752e-01 */ + .quad 0xBFD8A2244B27297E /* P2 = -3.848963483730115724200e-01 */ + .quad 0xBF52CA2F101EEF63 /* P3 = -1.146837196286797844817e-03 */ + .quad 0x3FC081BC342243AD /* P4 = +1.289592032012739958675e-01 */ + .quad 0xBFAE38DB4A932344 /* P5 = -5.902753148399722719732e-02 */ + .quad 0xBF91F814D4AE90C6 /* P6 = -1.754791782481459457885e-02 */ + .quad 0x3F9D056AE193C4F3 /* P7 = +2.834097863973723355792e-02 */ + .quad 0xBF7BD0B502D8F3A0 /* P8 = -6.790835451792626336974e-03 */ + .quad 0xBF7B763F7BB8AE2F /* P9 = -6.704566938008179114124e-03 */ + .quad 0x3F76036F42D9AB69 /* P10 = +5.374369252971835729099e-03 */ + .quad 0xBFE5000000000000 /* B = -.65625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8B64AF0450486E /* PL0 = +4.751979286662385162741e-17 */ + .quad 0x3FE3B75F8BCB742D /* PH0 = +6.161344271055263499548e-01 */ + .quad 0x3FE3DA23BC12369F /* P1 = +6.203783677353447780947e-01 */ + .quad 0xBFD8768FF4B46416 /* P2 = -3.822364701932782367281e-01 */ + .quad 0x3F9D67CB8AD9CB1A /* P3 = +2.871625933625941117406e-02 */ + .quad 0x3FBC168CB7827DF4 /* P4 = +1.097190807363331305006e-01 */ + .quad 0xBFB03A2B83C9272E /* P5 = -6.338760344911228324430e-02 */ + .quad 0xBF789FEB595297DC /* P6 = -6.011885959344067548074e-03 */ + .quad 0x3F98BD01B4C335E7 /* P7 = +2.415850320612902513532e-02 */ + .quad 0xBF83BADC303D6535 /* P8 = -9.633751127398152979976e-03 */ + .quad 0xBF6C54E7A1C1E3F3 /* P9 = -3.458454519258407989501e-03 */ + .quad 0x3F7408394B7EF3E7 /* P10 = +4.890655334688332484537e-03 */ + .quad 0xBFE7000000000000 /* B = -.71875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6A48557F6E0D3E /* PL0 = +1.139824111505584215867e-17 */ + .quad 0x3FE4E8D895B010DC /* PH0 = +6.534235881413468227663e-01 */ + .quad 0x3FE25652FAAF8A73 /* P1 = +5.730376144604875448991e-01 */ + .quad 0xBFD7F6C3A57C444B /* P2 = -3.744362941807295084434e-01 */ + .quad 0x3FAB7866E3F99EBE /* P3 = +5.365296872042567001598e-02 */ + .quad 0x3FB6FA1DF47CCD40 /* P4 = +8.975398272450707099784e-02 */ + .quad 0xBFB05508D3741B8E /* P5 = -6.379752314033580026840e-02 */ + .quad 0x3F6C3EFDF7BB279C /* P6 = +3.448005705512137236209e-03 */ + .quad 0x3F9372BADD6D3E27 /* P7 = +1.899234749299530050806e-02 */ + .quad 0xBF860FD5AE65F3DA /* P8 = -1.077238977881649471165e-02 */ + .quad 0xBF47266FFB07E628 /* P9 = -7.064863949032872448118e-04 */ + .quad 0x3F6F9763992C2A05 /* P10 = +3.856367614735181120799e-03 */ + .quad 0xBFE9000000000000 /* B = -.78125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6BB6A2B194E3AB /* PL0 = +1.201878007209462528697e-17 */ + .quad 0x3FE602609AAE7C22 /* PH0 = +6.877902051090851731630e-01 */ + .quad 0x3FE0DCBAFE191C7F /* P1 = +5.269446337560025312137e-01 */ + .quad 0xBFD732028428A9FB /* P2 = -3.624273577321727538225e-01 */ + .quad 0x3FB2D92389BE065B /* P3 = +7.362577545975439796588e-02 */ + .quad 0x3FB1F6A9C8C49993 /* P4 = +7.017003203927733370937e-02 */ + .quad 0xBFAF47C0B50B56EE /* P5 = -6.109430513394707378526e-02 */ + .quad 0x3F85A8EDD1356223 /* P6 = +1.057611269668352068104e-02 */ + .quad 0x3F8BE05C5CD1B4FA /* P7 = +1.361152799855823798207e-02 */ + .quad 0xBF85A0EFE4552F76 /* P8 = -1.056086936537046752272e-02 */ + .quad 0x3F559F2A6A356194 /* P9 = +1.319686337259627831943e-03 */ + .quad 0x3F6576F5E989208D /* P10 = +2.620201394425042596201e-03 */ + .quad 0xBFEB000000000000 /* B = -.84375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C80328BD86C8B74 /* PL0 = +2.809809047161267929701e-17 */ + .quad 0x3FE704BB1B7FCB81 /* PH0 = +7.193275010198335595035e-01 */ + .quad 0x3FDEE264AAD6C40C /* P1 = +4.825679462765613089739e-01 */ + .quad 0xBFD637493CE659F1 /* P2 = -3.471243948673921548357e-01 */ + .quad 0x3FB6BE3A3DEE6F4A /* P3 = +8.884014141079635303208e-02 */ + .quad 0x3FAA85EB6470AC0F /* P4 = +5.180297471118688523488e-02 */ + .quad 0xBFACC0146EA4858D /* P5 = -5.615295267694895314457e-02 */ + .quad 0x3F8F8FB683CDDAC5 /* P6 = +1.541082944616557159055e-02 */ + .quad 0x3F819515DEE2CB91 /* P7 = +8.585139145315585602547e-03 */ + .quad 0xBF834E45E6AF9EA1 /* P8 = -9.426637747267209169415e-03 */ + .quad 0x3F65250F197CA56D /* P9 = +2.581147662472352252568e-03 */ + .quad 0x3F57A766026D036C /* P10 = +1.443719500187702367690e-03 */ + .quad 0xBFED000000000000 /* B = -.90625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C716F7EEF7B61AD /* PL0 = +1.512291215142578135651e-17 */ + .quad 0x3FE7F0E1A4CD846E /* PH0 = +7.481544703297353660076e-01 */ + .quad 0x3FDC2D4CC872DC09 /* P1 = +4.402648885256331012598e-01 */ + .quad 0xBFD514A99F92ED53 /* P2 = -3.293861444796750250530e-01 */ + .quad 0x3FB9846A6CF2F337 /* P3 = +9.967675361526749494844e-02 */ + .quad 0x3FA20896939AB161 /* P4 = +3.522177268800664413493e-02 */ + .quad 0xBFA97E801F31EE0D /* P5 = -4.979324703978358553405e-02 */ + .quad 0x3F92A11F47B82085 /* P6 = +1.819275737037219740638e-02 */ + .quad 0x3F717D70FE289C34 /* P7 = +4.270020845559097605514e-03 */ + .quad 0xBF7FDCF1D3F6CE2D /* P8 = -7.779068604054678540132e-03 */ + .quad 0x3F69F607E81AF6B6 /* P9 = +3.169074480722534625181e-03 */ + .quad 0x3F3F925C80D0F889 /* P10 = +4.817462766516585511824e-04 */ + .quad 0xBFEF000000000000 /* B = -.96875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C931A11D7E8606E /* PL0 = +6.627280241435322692188e-17 */ + .quad 0x3FE92BFB370D9B71 /* PH0 = +7.866188121086975515439e-01 */ + .quad 0x3FD866160E454111 /* P1 = +3.812308444367014680480e-01 */ + .quad 0xBFD33149F3801DBA /* P2 = -2.998833539899937679796e-01 */ + .quad 0x3FBBDB6D4C949899 /* P3 = +1.088169395412442909023e-01 */ + .quad 0x3F8D6AB2A74B9343 /* P4 = +1.436366627735597372494e-02 */ + .quad 0xBFA404D1047C5D72 /* P5 = -3.909924678571997970917e-02 */ + .quad 0x3F93C47D9ACCD919 /* P6 = +1.930423981976856424661e-02 */ + .quad 0xBF41B755642CFF1B /* P7 = -5.406538915408738478158e-04 */ + .quad 0xBF74B5301AA1E788 /* P8 = -5.055606752756853900641e-03 */ + .quad 0x3F69A84C5B2A3E68 /* P9 = +3.132008679422249529120e-03 */ + .quad 0xBF3CF47830328C11 /* P10 = -4.418176105877589308931e-04 */ + .quad 0xBFF1000000000000 /* B = -1.0625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C884D471B8FD396 /* PL0 = +4.215701792312937090514e-17 */ + .quad 0x3FEA8DBCBC31897A /* PH0 = +8.298019099859594849278e-01 */ + .quad 0x3FD3EE730537C8EA /* P1 = +3.114287901836535219818e-01 */ + .quad 0xBFD08A05AD27CE32 /* P2 = -2.584242049190123217982e-01 */ + .quad 0x3FBC5255406F84B6 /* P3 = +1.106313021005175045399e-01 */ + .quad 0xBF772FA2F633AA5E /* P4 = -5.660664147607434209241e-03 */ + .quad 0xBF99DD8E4C473FC4 /* P5 = -2.525923100057504533247e-02 */ + .quad 0x3F9183C935B6495D /* P6 = +1.710428610165003372069e-02 */ + .quad 0xBF70471A3A591480 /* P7 = -3.974058583087303228038e-03 */ + .quad 0xBF603DDD4DEBB9A4 /* P8 = -1.982624278176818987264e-03 */ + .quad 0x3F62591E44D3C17F /* P9 = +2.239760512218135956425e-03 */ + .quad 0xBF4C195D3A9B1AB4 /* P10 = -8.575158328419569430544e-04 */ + .quad 0xBFF3000000000000 /* B = -1.1875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C90DD1C9BFF7F64 /* PL0 = +5.850777430004479798187e-17 */ + .quad 0x3FEBAD50A4A68BC1 /* PH0 = +8.649066177207417327466e-01 */ + .quad 0x3FD01FBA72CEE1A5 /* P1 = +2.519365426228666233893e-01 */ + .quad 0xBFCBE432F647C4D6 /* P2 = -2.179015829602010702633e-01 */ + .quad 0x3FBABF92B6E5AC73 /* P3 = +1.044856735731387955105e-01 */ + .quad 0xBF922983AA24E217 /* P4 = -1.773648954369563555378e-02 */ + .quad 0xBF8C72214C14E23A /* P5 = -1.388956082756564056328e-02 */ + .quad 0x3F8ACB4D1F388E8B /* P6 = +1.308307887581540972153e-02 */ + .quad 0xBF740EF8B4A2EE3B /* P7 = -4.897090441029978580995e-03 */ + .quad 0xBF0EA9F30C8DC900 /* P8 = -5.848668076326342477133e-05 */ + .quad 0x3F53CC40D18713AE /* P9 = +1.208365725788622757410e-03 */ + .quad 0xBF4848B86029CBA1 /* P10 = -7.410908004444779592485e-04 */ + .quad 0xBFF5000000000000 /* B = -1.3125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8FB61781D22681 /* PL0 = +5.501032995458057064843e-17 */ + .quad 0x3FEC950A3340C8BF /* PH0 = +8.931933404003514764824e-01 */ + .quad 0x3FC9E1DFFD385423 /* P1 = +2.022056566644617586005e-01 */ + .quad 0xBFC71E2FF88EBA23 /* P2 = -1.806087459239772032583e-01 */ + .quad 0x3FB80AEBD07AB5BA /* P3 = +9.391664352252506838449e-02 */ + .quad 0xBF98404E27EAE6ED /* P4 = -2.368280523908243895884e-02 */ + .quad 0xBF772DA520B5006E /* P5 = -5.658764868087568802107e-03 */ + .quad 0x3F824C9268AF9423 /* P6 = +8.935111827620250551925e-03 */ + .quad 0xBF722AE76D206AE3 /* P7 = -4.435447701349490160113e-03 */ + .quad 0x3F4B807F56298D5E /* P8 = +8.392926941493230644497e-04 */ + .quad 0x3F3D71027DF95D2A /* P9 = +4.492407879061627603159e-04 */ + .quad 0xBF3EBD17676755FB /* P10 = -4.690343988874298905483e-04 */ + .quad 0xBFF7000000000000 /* B = -1.4375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C95393C63CE8224 /* PL0 = +7.363407705201031038415e-17 */ + .quad 0x3FED4E6F464286B0 /* PH0 = +9.158245441687622445670e-01 */ + .quad 0x3FC4A45842B7DE1E /* P1 = +1.612654042980787191461e-01 */ + .quad 0xBFC2E7885AFDD3D0 /* P2 = -1.476908153814791087327e-01 */ + .quad 0x3FB4DD6DD51D3FEB /* P3 = +8.150373890862254580204e-02 */ + .quad 0xBF9A05D3ADAB489C /* P4 = -2.541285274021075503042e-02 */ + .quad 0xBF3459B643B4995C /* P5 = -3.105230313899165257622e-04 */ + .quad 0x3F766B30745F2E3A /* P6 = +5.473317409222350365811e-03 */ + .quad 0xBF6C2C891E555BDF /* P7 = -3.439204988051155730940e-03 */ + .quad 0x3F5194F30D6C576D /* P8 = +1.073109966176012791522e-03 */ + .quad 0x3EF4DBB43C3132A2 /* P9 = +1.989194766975849961365e-05 */ + .quad 0xBF2E45EBAB3C15A0 /* P10 = -2.309656316514087783666e-04 */ + .quad 0xBFF9000000000000 /* B = -1.5625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C75111669651DAA /* PL0 = +1.827249135453834384396e-17 */ + .quad 0x3FEDE1EB5937518F /* PH0 = +9.338280432225917193634e-01 */ + .quad 0x3FC06129C7C8EBB1 /* P1 = +1.279651856910653382507e-01 */ + .quad 0xBFBE9763041064E1 /* P2 = -1.194974789545031421774e-01 */ + .quad 0x3FB1A5B9F9113928 /* P3 = +6.893503504509068635308e-02 */ + .quad 0xBF992145039F9AFE /* P4 = -2.454097590080105816526e-02 */ + .quad 0x3F66CB116EA49C89 /* P5 = +2.782377288116648315142e-03 */ + .quad 0x3F67F972FDF30001 /* P6 = +2.926563829163342740100e-03 */ + .quad 0xBF63A7B5975F02F3 /* P7 = -2.399305983061922438601e-03 */ + .quad 0x3F4FDE7B8777F4C8 /* P8 = +9.725669069095216373599e-04 */ + .quad 0xBF25918876626BA4 /* P9 = -1.645545082212515656240e-04 */ + .quad 0xBF1495123C991F00 /* P10 = -7.851527984669912693674e-05 */ + .quad 0xBFFB000000000000 /* B = -1.6875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9F29A5B7426D27 /* PL0 = +1.081172820484012446345e-16 */ + .quad 0x3FEE56B6F3EFABFC /* PH0 = +9.480852856044061915952e-01 */ + .quad 0x3FB9E3EFD94BB9FC /* P1 = +1.011342912204113371518e-01 */ + .quad 0xBFB88BD9760FECA7 /* P2 = -9.588393337610288420285e-02 */ + .quad 0x3FAD48A0350B3ACF /* P3 = +5.719471595295077387313e-02 */ + .quad 0xBF96CC6A5110F129 /* P4 = -2.226415748394675367257e-02 */ + .quad 0x3F71934687170384 /* P5 = +4.290843485649345772606e-03 */ + .quad 0x3F5407BAF73B3DF9 /* P6 = +1.222546180475235334287e-03 */ + .quad 0xBF591B626C0646DD /* P7 = -1.532407870488964407324e-03 */ + .quad 0x3F48B0E1DD283558 /* P8 = +7.535078860329375669277e-04 */ + .quad 0xBF2B322292840D2B /* P9 = -2.074877932117605962646e-04 */ + .quad 0xBE99E4061120C741 /* P10 = -3.858017559892704559672e-07 */ + .quad 0xBFFD000000000000 /* B = -1.8125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6AF8C2041C67CD /* PL0 = +1.169711482626385762338e-17 */ + .quad 0x3FEEB2DFEDD5EC93 /* PH0 = +9.593352933146824801369e-01 */ + .quad 0x3FB465A205CFB638 /* P1 = +7.967579500083210999681e-02 */ + .quad 0xBFB3914BF68D39FF /* P2 = -7.643580216720378576778e-02 */ + .quad 0x3FA7F21A08C5C734 /* P3 = +4.676896435820623621673e-02 */ + .quad 0xBF93DA9560EA9960 /* P4 = -1.938851741820124550772e-02 */ + .quad 0x3F73953FEC62820E /* P5 = +4.781007481284861359820e-03 */ + .quad 0x3F2749D5E1273E3C /* P6 = +1.776765426044646108071e-04 */ + .quad 0xBF4D46B0B498CE5A /* P7 = -8.934367007839658352859e-04 */ + .quad 0x3F4153D680E1F4C4 /* P8 = +5.287930851093571206574e-04 */ + .quad 0xBF28477014ECA6A2 /* P9 = -1.852344816708944640949e-04 */ + .quad 0x3EFFAC54E07CEB4B /* P10 = +3.020588886147182143902e-05 */ + .quad 0xBFFF000000000000 /* B = -1.9375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7A8AF2BB2231F2 /* PL0 = +2.302217989249372577466e-17 */ + .quad 0x3FEF1994DF724FC8 /* PH0 = +9.718727459135090285258e-01 */ + .quad 0x3FAC65B1BC0C9D58 /* P1 = +5.546336575053583942603e-02 */ + .quad 0xBFAB9937BDA747C8 /* P2 = -5.390333356957871365599e-02 */ + .quad 0x3FA15B42D9EF931C /* P3 = +3.389939222669210777241e-02 */ + .quad 0xBF8EACD8E8507A3C /* P4 = -1.497811755149058215502e-02 */ + .quad 0x3F7263A15721C682 /* P5 = +4.489546046998806349050e-03 */ + .quad 0xBF42A032ACDC3B32 /* P6 = -5.684134900735048121829e-04 */ + .quad 0xBF3431E79B5AD185 /* P7 = -3.081503340170088810438e-04 */ + .quad 0x3F31B51667C7DF5E /* P8 = +2.701930714290502424828e-04 */ + .quad 0xBF1F8709579250AD /* P9 = -1.202678157759563704341e-04 */ + .quad 0x3F01ED8ED1BF9595 /* P10 = +3.419487094883790833778e-05 */ + .quad 0xC001000000000000 /* B = -2.125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C86F3F7C3DAFC55 /* PL0 = +3.981710680748877459333e-17 */ + .quad 0x3FEF73776B2AA2DB /* PH0 = +9.828450291725759901951e-01 */ + .quad 0x3FA16A7FC4D7B900 /* P1 = +3.401564863075812007064e-02 */ + .quad 0xBFA11E03803AD621 /* P2 = -3.343211117082156940532e-02 */ + .quad 0x3F9609591597297F /* P3 = +2.152003473546803654658e-02 */ + .quad 0xBF847E74ED9BBB0C /* P4 = -1.000682211039596246436e-02 */ + .quad 0x3F6BFF771725CD65 /* P5 = +3.417713736035987187864e-03 */ + .quad 0xBF491D1FF73C18FA /* P6 = -7.664114077392807421000e-04 */ + .quad 0x3EF53EE467B51DC5 /* P7 = +2.026145237479599375099e-05 */ + .quad 0x3F160135BE0D94A0 /* P8 = +8.394136922403255700685e-05 */ + .quad 0xBF0B32CB1D276A40 /* P9 = -5.187685350778849443841e-05 */ + .quad 0x3EF4DAF70C12D555 /* P10 = +1.988919462255396826584e-05 */ + .quad 0xC003000000000000 /* B = -2.375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C19DBF4E2E5B7DC /* PL0 = +3.504575836708380670219e-19 */ + .quad 0x3FEFAA7934B75EBD /* PH0 = +9.895597486128832054320e-01 */ + .quad 0x3F9545200830A42C /* P1 = +2.077150392520736492125e-02 */ + .quad 0xBF950C46D285F6BC /* P2 = -2.055464420253970271376e-02 */ + .quad 0x3F8B79F5BFC6513F /* P3 = +1.341621390819425058164e-02 */ + .quad 0xBF7A50ADAD777898 /* P4 = -6.424597194806612772505e-03 */ + .quad 0x3F633A19BE8255E3 /* P5 = +2.347040444940816227383e-03 */ + .quad 0xBF44E609BC2557B7 /* P6 = -6.377742322836087134324e-04 */ + .quad 0x3F1AFCBAD60EAACD /* P7 = +1.029480968230231421206e-04 */ + .quad 0x3EE80476AC34A8EF /* P8 = +1.145240583485084317660e-05 */ + .quad 0xBEF278E23DE463E9 /* P9 = -1.761646478213091821804e-05 */ + .quad 0x3EE209FAF377264D /* P10 = +8.601658563106529694651e-06 */ + .quad 0xC005000000000000 /* B = -2.625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C979D62702C631C /* PL0 = +8.193023793215066385979e-17 */ + .quad 0x3FEFCC04CDBCDC4B /* PH0 = +9.936546343150295390600e-01 */ + .quad 0x3F89E87D088D269A /* P1 = +1.265046770426474576547e-02 */ + .quad 0xBF89BE6721012B80 /* P2 = -1.257019586059526836624e-02 */ + .quad 0x3F80F1C13E8D39D3 /* P3 = +8.273610803056031004326e-03 */ + .quad 0xBF7082DBC9602757 /* P4 = -4.031046430108839563004e-03 */ + .quad 0x3F590BE9BD4E0A11 /* P5 = +1.528719197467002507978e-03 */ + .quad 0xBF3DCC2BEF6D0283 /* P6 = -4.546744598208711809986e-04 */ + .quad 0x3F1A08065C4A8E85 /* P7 = +9.930170842636406837764e-05 */ + .quad 0xBEE528117D0410F3 /* P8 = -1.008821337267942266431e-05 */ + .quad 0xBED0BE73A44FF565 /* P9 = -3.992069257383521775961e-06 */ + .quad 0x3EC9B0C11E342E38 /* P10 = +3.062539904901699218737e-06 */ + .quad 0xC007000000000000 /* B = -2.875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C804B931AD7A3CC /* PL0 = +2.826768921701616830245e-17 */ + .quad 0x3FEFE06EB0688212 /* PH0 = +9.961465306733450209009e-01 */ + .quad 0x3F7F81BD8876224D /* P1 = +7.692089427458426472642e-03 */ + .quad 0xBF7F62A8C699A963 /* P2 = -7.662448196791823756776e-03 */ + .quad 0x3F74C31E2B2A6A28 /* P3 = +5.068891378551522166321e-03 */ + .quad 0xBF6470D537F16227 /* P4 = -2.495209162173734080001e-03 */ + .quad 0x3F4FAEEF61C89673 /* P5 = +9.668988091717359455754e-04 */ + .quad 0xBF33C5E80B349783 /* P6 = -3.017131341088651514023e-04 */ + .quad 0x3F138F3D31037A6B /* P7 = +7.461367590931028650557e-05 */ + .quad 0xBEEB3C780996FFE3 /* P8 = -1.298723536791163711556e-05 */ + .quad 0x3E9D0C75BC8BFEFC /* P9 = +4.328589367358221917138e-07 */ + .quad 0x3EAC3865227764D4 /* P10 = +8.410302755848104487452e-07 */ + .quad 0xC009000000000000 /* B = -3.125 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C5B978B202749F9 /* PL0 = +5.983054034451594408315e-18 */ + .quad 0x3FEFECD6B7EA3128 /* PH0 = +9.976609794698889643882e-01 */ + .quad 0x3F73238B786137FE /* P1 = +4.672570043181776968058e-03 */ + .quad 0xBF731815ACEA072E /* P2 = -4.661640805922390930706e-03 */ + .quad 0x3F6956F0816D5AEE /* P3 = +3.093213784647877798933e-03 */ + .quad 0xBF591A16286C4885 /* P4 = -1.532098425461232453877e-03 */ + .quad 0x3F43B3E3A00C6096 /* P5 = +6.012784434430592468442e-04 */ + .quad 0xBF29441B2A56DEC7 /* P6 = -1.927645836710038499293e-04 */ + .quad 0x3F0A99C3A2E857B6 /* P7 = +5.073669705184196724674e-05 */ + .quad 0xBEE61CB034DDC151 /* P8 = -1.054385361573597042258e-05 */ + .quad 0x3EB792BBC76D6107 /* P9 = +1.405070887824641788698e-06 */ + .quad 0x3E761472362A16F0 /* P10 = +8.225391704739515383837e-08 */ + .quad 0xC00B000000000000 /* B = -3.375 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9C290AFCBDE00D /* PL0 = +9.770074992945060684926e-17 */ + .quad 0x3FEFF45F6D36133A /* PH0 = +9.985806592017987259879e-01 */ + .quad 0x3F673CEC093032DE /* P1 = +2.836667068100913999228e-03 */ + .quad 0xBF67347A7CD844D5 /* P2 = -2.832640870800243808078e-03 */ + .quad 0x3F5EDA25530355DB /* P3 = +1.883064698679040793627e-03 */ + .quad 0xBF4EAD3BBABC1BA9 /* P4 = -9.361783645268534848806e-04 */ + .quad 0x3F3842E61CD35432 /* P5 = +3.701984213198588740338e-04 */ + .quad 0xBF1F9AB7FD1A3DDD /* P6 = -1.205611036090218544867e-04 */ + .quad 0x3F0136C154EA3DED /* P7 = +3.283288480304320224929e-05 */ + .quad 0xBEDF12807F721E66 /* P8 = -7.408207230892235753013e-06 */ + .quad 0x3EB5B53687AD5112 /* P9 = +1.293889481520047941659e-06 */ + .quad 0xBE801E90FBFED147 /* P10 = -1.200988872775447204019e-07 */ + .quad 0xC00D000000000000 /* B = -3.625 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9E323294294877 /* PL0 = +1.047637125334028950603e-16 */ + .quad 0x3FEFF8F21CDAAA62 /* PH0 = +9.991388858373506653976e-01 */ + .quad 0x3F5C3470628813F2 /* P1 = +1.721486807697344658108e-03 */ + .quad 0xBF5C2E38AC6FF8D2 /* P2 = -1.720004411026422324849e-03 */ + .quad 0x3F52C13234626F43 /* P3 = +1.144694354969070234454e-03 */ + .quad 0xBF42B0A47DF47BB4 /* P4 = -5.703738387728891173354e-04 */ + .quad 0x3F2DB2889E32FBFD /* P5 = +2.265731592156760387344e-04 */ + .quad 0xBF1385FBD54C5A55 /* P6 = -7.447576110695385196414e-05 */ + .quad 0x3EF5AFA812C6984E /* P7 = +2.068153223579892541184e-05 */ + .quad 0xBED47097C188A03C /* P8 = -4.873231795467276043290e-06 */ + .quad 0x3EAFF2B982F7EE8C /* P9 = +9.521288628073486288914e-07 */ + .quad 0xBE828EC5B57D424D /* P10 = -1.382656715739529384702e-07 */ + .quad 0xC00F000000000000 /* B = -3.875 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9BA40DA6983BEC /* PL0 = +9.589840482158163453169e-17 */ + .quad 0x3FEFFCAAC3F20E65 /* PH0 = +9.995931460438894911036e-01 */ + .quad 0x3F4AA87CF664754C /* P1 = +8.135423820793490331956e-04 */ + .quad 0xBF4AA5B62919E224 /* P2 = -8.132113891426467676310e-04 */ + .quad 0x3F41C01B53B0B312 /* P3 = +5.416997368051531710388e-04 */ + .quad 0xBF31B8B54D091751 /* P4 = -2.704088811110632606347e-04 */ + .quad 0x3F1C431305954ECC /* P5 = +1.078110084525254933728e-04 */ + .quad 0xBF02B7DEAD0D44E6 /* P6 = -3.570221236393906131126e-05 */ + .quad 0x3EE51C6EFF109EA9 /* P7 = +1.006654199116272154479e-05 */ + .quad 0xBEC48CFB08072D17 /* P8 = -2.449834994621594976610e-06 */ + .quad 0x3EA1585EC59CAE34 /* P9 = +5.169271261920604503617e-07 */ + .quad 0xBE78832BAF950BA9 /* P10 = -9.131575131209528255629e-08 */ + .quad 0xC011000000000000 /* B = -4.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8FBF237F4AFE10 /* PL0 = +5.507163370275307643966e-17 */ + .quad 0x3FEFFEC61279A3A4 /* PH0 = +9.998503075449787225182e-01 */ + .quad 0x3F339E78281A00EA /* P1 = +2.993625022114214863645e-04 */ + .quad 0xBF339DB7B072AD62 /* P2 = -2.993176899035080028902e-04 */ + .quad 0x3F2A259E658EF4E4 /* P3 = +1.994853835451177669594e-04 */ + .quad 0xBF1A219C312B10BA /* P4 = -9.968295880030927192162e-05 */ + .quad 0x3F04E146B4F5F4B7 /* P5 = +3.982541113154699160876e-05 */ + .quad 0xBEEBC5F137088210 /* P6 = -1.324329943580649487333e-05 */ + .quad 0x3ECF96736E300B00 /* P7 = +3.765547135882256916132e-06 */ + .quad 0xBEAF4874840B91EB /* P8 = -9.323068824421825762292e-07 */ + .quad 0x3E8B6AB2B5C8FD3F /* P9 = +2.042709991312793245971e-07 */ + .quad 0xBE650BCCE62FD2B7 /* P10 = -3.920140725219944650830e-08 */ + .quad 0xC013000000000000 /* B = -4.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9C869C85471703 /* PL0 = +9.896883942603146946483e-17 */ + .quad 0x3FEFFF8C81C6DC33 /* PH0 = +9.999449286177707341139e-01 */ + .quad 0x3F1CDF5A2E4D7C69 /* P1 = +1.101397316012206760643e-04 */ + .quad 0xBF1CDEF1F9BE63BE /* P2 = -1.101336660539594564027e-04 */ + .quad 0x3F133EC10C83AAA0 /* P3 = +7.341435696487731017506e-05 */ + .quad 0xBF033DAB325FAACB /* P4 = -3.669909192168459445238e-05 */ + .quad 0x3EEEC598FA98BAD8 /* P5 = +1.467316890843338172161e-05 */ + .quad 0xBED47F1A15BA368E /* P6 = -4.886744445221253126882e-06 */ + .quad 0x3EB761FBE7D201C1 /* P7 = +1.393720509029845064726e-06 */ + .quad 0xBE974CD75A43BF6B /* P8 = -3.471994551992448536007e-07 */ + .quad 0x3E74B02965BBF8DC /* P9 = +7.706929621914905669946e-08 */ + .quad 0xBE504EF4E3892A66 /* P10 = -1.518840362012570189110e-08 */ + .quad 0xC015000000000000 /* B = -5.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C643810400471B0 /* PL0 = +8.768592603904887599187e-18 */ + .quad 0x3FEFFFD583014825 /* PH0 = +9.999797400180382433987e-01 */ + .quad 0x3F053E71416C43CA /* P1 = +4.051955345663706869871e-05 */ + .quad 0xBF053E550C7C8CC9 /* P2 = -4.051873253121394012080e-05 */ + .quad 0x3EFC52D0D90D4843 /* P3 = +2.701139380018752534477e-05 */ + .quad 0xBEEC523A6ADBE142 /* P4 = -1.350460237457883558350e-05 */ + .quad 0x3ED6A73E22D844B3 /* P5 = +5.400965660055565196396e-06 */ + .quad 0xBEBE31D10F23ACD0 /* P6 = -1.799738182979224868919e-06 */ + .quad 0x3EA13E14264DEAB2 /* P7 = +5.138663935333241981438e-07 */ + .quad 0xBE81385ABB98EDCC /* P8 = -1.282999997786486835638e-07 */ + .quad 0x3E5EB9164593E0B6 /* P9 = +2.861301981891537161158e-08 */ + .quad 0xBE387218CFE7772E /* P10 = -5.691705994073124478195e-09 */ + .quad 0xC017000000000000 /* B = -5.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C92530433F4C703 /* PL0 = +6.357512739163799046861e-17 */ + .quad 0x3FEFFFF05E8D3191 /* PH0 = +9.999925467214315633058e-01 */ + .quad 0x3EEF42DDFA52B575 /* P1 = +1.490650158538873335176e-05 */ + .quad 0xBEEF42CEB54212AA /* P2 = -1.490639048307961378200e-05 */ + .quad 0x3EE4D7201CBCB853 /* P3 = +9.937445518550804010127e-06 */ + .quad 0xBED4D6F764B66C37 /* P4 = -4.968574624976280456686e-06 */ + .quad 0x3EC0ABB806EBDE71 /* P5 = +1.987311456171617620608e-06 */ + .quad 0xBEA6399CF854F876 /* P6 = -6.623581475862682369330e-07 */ + .quad 0x3E8964B91728D7C9 /* P7 = +1.891959403186505598965e-07 */ + .quad 0xBE6961A0528444D6 /* P8 = -4.727645325404986954168e-08 */ + .quad 0x3E46AE3B0814EE00 /* P9 = +1.056147192151514779549e-08 */ + .quad 0xBE221B8194DACD16 /* P10 = -2.107984154277957626641e-09 */ + .quad 0xC019000000000000 /* B = -6.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C7BB5622CE1A79E /* PL0 = +2.403331811901679167526e-17 */ + .quad 0x3FEFFFFA3FF22708 /* PH0 = +9.999972580855862602789e-01 */ + .quad 0x3ED7003552D53503 /* P1 = +5.483821309338170039906e-06 */ + .quad 0xBED7003130C1AB92 /* P2 = -5.483806273169366545037e-06 */ + .quad 0x3ECEAAE13B699C45 /* P3 = +3.655850800133043324271e-06 */ + .quad 0xBEBEAACB305F3D07 /* P4 = -1.827905351959291114416e-06 */ + .quad 0x3EA8887F5F9C87EF /* P5 = +7.311461438267648556646e-07 */ + .quad 0xBE905AD08DF8454F /* P6 = -2.437046884027860662692e-07 */ + .quad 0x3E72B068300B703F /* P7 = +6.962228483613086736676e-08 */ + .quad 0xBE52AF921A71C058 /* P8 = -1.740252888706390465423e-08 */ + .quad 0x3E30B53EAA35300D /* P9 = +3.890131469838137725119e-09 */ + .quad 0xBE0AB60CDAD7E22E /* P10 = -7.773963050435300060566e-10 */ + .quad 0xC01B000000000000 /* B = -6.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8BD1ACF80D7256 /* PL0 = +4.825835138930451121169e-17 */ + .quad 0x3FEFFFFDE2760A41 /* PH0 = +9.999989913051835488389e-01 */ + .quad 0x3EC0EC4F1EC27E55 /* P1 = +2.017388615341105998718e-06 */ + .quad 0xBEC0EC4E005E6EAC /* P2 = -2.017386580411626200507e-06 */ + .quad 0x3EB6906504BC4610 /* P3 = +1.344921673533307001969e-06 */ + .quad 0xBEA6905F0D52C8B5 /* P4 = -6.724581235377781360384e-07 */ + .quad 0x3E920D0F5CCE152B /* P5 = +2.689810941136721216499e-07 */ + .quad 0xBE7811505B10E753 /* P6 = -8.965891741619763761543e-08 */ + .quad 0x3E5B811EE4F9B8EE /* P7 = +2.561544781706659619288e-08 */ + .quad 0xBE3B80ABC067E840 /* P8 = -6.403452884688571158579e-09 */ + .quad 0x3E1898E394E09335 /* P9 = +1.431746793613569087489e-09 */ + .quad 0xBDF3ABB5BA711DB7 /* P10 = -2.862469657501951918569e-10 */ + .quad 0xC01D000000000000 /* B = -7.25 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8AE01DB39A3791 /* PL0 = +4.662147961093911873193e-17 */ + .quad 0x3FEFFFFF38C76668 /* PH0 = +9.999996289217962797125e-01 */ + .quad 0x3EA8E712E56E1188 /* P1 = +7.421562696484951529573e-07 */ + .quad 0xBEA8E7124A650791 /* P2 = -7.421559942504648535596e-07 */ + .quad 0x3EA09A0B62D8EF94 /* P3 = +4.947702955735978541097e-07 */ + .quad 0xBE909A09C56C2107 /* P4 = -2.473847805916120382218e-07 */ + .quad 0x3E7A900A90A54A6E /* P5 = +9.895362410487317236618e-08 */ + .quad 0xBE61B5557BB449B6 /* P6 = -3.298434544432568302770e-08 */ + .quad 0x3E443CC74732CDCA /* P7 = +9.423781066565733462466e-09 */ + .quad 0xBE243CA8AA8D6E54 /* P8 = -2.355890888986360997159e-09 */ + .quad 0x3E0219C341E0D1B4 /* P9 = +5.267978308406275552691e-10 */ + .quad 0xBDDCF49A10950F13 /* P10 = -1.053394074620716018815e-10 */ + .quad 0xC01F000000000000 /* B = -7.75 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C75CB18F3775414 /* PL0 = +1.890271747518592444083e-17 */ + .quad 0x3FEFFFFFD38C39F0 /* PH0 = +9.999999172012490333827e-01 */ + .quad 0x3E8639E2F89493BB /* P1 = +1.655974950855472979393e-07 */ + .quad 0xBE8639E2D9B29562 /* P2 = -1.655974813708346974914e-07 */ + .quad 0x3E7DA2836A1F706E /* P3 = +1.103982989742589616541e-07 */ + .quad 0xBE6DA282C6733DAE /* P4 = -5.519913131581509871840e-08 */ + .quad 0x3E57B53A278851FD /* P5 = +2.207971980430773309147e-08 */ + .quad 0xBE3F9C4A72536E22 /* P6 = -7.359895614149337484810e-09 */ + .quad 0x3E220E81FBE19CDD /* P7 = +2.102073153607135257714e-09 */ + .quad 0xBE020E8875ADA8D8 /* P8 = -5.255211642212584097407e-10 */ + .quad 0x3DE07634328384FC /* P9 = +1.197748786062966341989e-10 */ + .quad 0xBDBA54078E3C351F /* P10 = -2.394539505021488953905e-11 */ + .quad 0xC021000000000000 /* B = -8.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C98B78738B0EDEF /* PL0 = +8.575399788039081964921e-17 */ + .quad 0x3FEFFFFFF9FBEA40 /* PH0 = +9.999999887944071019774e-01 */ + .quad 0x3E581056FAC28C46 /* P1 = +2.241118550516412682327e-08 */ + .quad 0xBE581056F63A4351 /* P2 = -2.241118525356742542550e-08 */ + .quad 0x3E500AE49533790A /* P3 = +1.494078933911655875521e-08 */ + .quad 0xBE400AE489ACBA90 /* P4 = -7.470394349637968945652e-09 */ + .quad 0x3E29AB0D59A1967B /* P5 = +2.988168557255271725494e-09 */ + .quad 0xBE111CB32D6EEF2B /* P6 = -9.960558400070350772418e-10 */ + .quad 0x3DF38CBADF396908 /* P7 = +2.844859618921805216353e-10 */ + .quad 0xBDD38CC7B92CECD3 /* P8 = -7.112220386749926320915e-11 */ + .quad 0x3DB1D2BBE2705032 /* P9 = +1.621008722427575444686e-11 */ + .quad 0xBD8C8199294E6380 /* P10 = -3.240784656869469020111e-12 */ + .quad 0xC023000000000000 /* B = -9.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8EEEC16618B984 /* PL0 = +5.365957423487855307906e-17 */ + .quad 0x3FEFFFFFFF2F9279 /* PH0 = +9.999999984834878619111e-01 */ + .quad 0x3E2A0DB0D052B148 /* P1 = +3.033024167396880687734e-09 */ + .quad 0xBE2A0DB0CFA6AB71 /* P2 = -3.033024162734192808028e-09 */ + .quad 0x3E215E75D53A3105 /* P3 = +2.022016035353114070618e-09 */ + .quad 0xBE115E75D40AA47F /* P4 = -1.011008013562702155050e-09 */ + .quad 0x3DFBCA5CDC12ED1C /* P5 = +4.044047007631481841556e-10 */ + .quad 0xBDE286E85704FC22 /* P6 = -1.348015410318274576187e-10 */ + .quad 0x3DC52A8925354517 /* P7 = +3.850101197145027796396e-11 */ + .quad 0xBDA52A97EA3F5F4A /* P8 = -9.625355478142550638468e-12 */ + .quad 0x3D834C011A2AC0F7 /* P9 = +2.193802608697321032841e-12 */ + .quad 0xBD5EDD05BDCB3A62 /* P10 = -4.385948508419928563300e-13 */ + .quad 0xC025000000000000 /* B = -10.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6BD8B474BBF792 /* PL0 = +1.207649585364892639612e-17 */ + .quad 0x3FEFFFFFFFE3CAD8 /* PH0 = +9.999999997947623953110e-01 */ + .quad 0x3DFC3527E43C565F /* P1 = +4.104751852963940338559e-10 */ + .quad 0xBDFC3527E420F415 /* P2 = -4.104751852036136216697e-10 */ + .quad 0x3DF2CE1A8D806DAD /* P3 = +2.736501142887952919489e-10 */ + .quad 0xBDE2CE1A8DDF690A /* P4 = -1.368250573053032426141e-10 */ + .quad 0x3DCE169832D8BD68 /* P5 = +5.473022586854025789680e-11 */ + .quad 0xBDB40F0FE853DA5B /* P6 = -1.824340550195944358477e-11 */ + .quad 0x3D96EA8D930D31A1 /* P7 = +5.210545794901128943676e-12 */ + .quad 0xBD76EA9DB0D09839 /* P8 = -1.302650427355019556441e-12 */ + .quad 0x3D54E474FD4303A1 /* P9 = +2.968990047962355000258e-13 */ + .quad 0xBD30B526CA2B228A /* P10 = -5.935740124899435401321e-14 */ + .quad 0xC027000000000000 /* B = -11.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C56E8953D525FD5 /* PL0 = +4.967494994909661698725e-18 */ + .quad 0x3FEFFFFFFFFC2EB9 /* PH0 = +9.999999999722241073030e-01 */ + .quad 0x3DCE8A37A48016C2 /* P1 = +5.555177547354687971427e-11 */ + .quad 0xBDCE8A37A479B7D4 /* P2 = -5.555177547084873157964e-11 */ + .quad 0x3DC45C250CFA9C16 /* P3 = +3.703451575129414499553e-11 */ + .quad 0xBDB45C250D9F8467 /* P4 = -1.851725791056759260154e-11 */ + .quad 0x3DA049BB33CBD4E9 /* P5 = +7.406930640558963265190e-12 */ + .quad 0xBD85B7A407C422C1 /* P6 = -2.468976464832073512208e-12 */ + .quad 0x3D68CF9CED2B3FD5 /* P7 = +7.051706989348171774536e-13 */ + .quad 0xBD48CFAE64C352B3 /* P8 = -1.762945685274427023683e-13 */ + .quad 0x3D269EAE08690D52 /* P9 = +4.018091287355461204663e-14 */ + .quad 0xBD0216CBEAFFF5AA /* P10 = -8.033151495672990022322e-15 */ + .quad 0xC029000000000000 /* B = -12.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C8ACF1392B106D3 /* PL0 = +4.650601502940921454330e-17 */ + .quad 0x3FEFFFFFFFFF7BBD /* PH0 = +9.999999999962408958609e-01 */ + .quad 0x3DA088529889B316 /* P1 = +7.518115268189742464885e-12 */ + .quad 0xBDA088529887F4C4 /* P2 = -7.518115268005149164680e-12 */ + .quad 0x3D960B18BF1DF711 /* P3 = +5.012076679213679703380e-12 */ + .quad 0xBD860B18BFD99A48 /* P4 = -2.506038344573564868987e-12 */ + .quad 0x3D71A27E7CA64143 /* P5 = +1.002419056539285288454e-12 */ + .quad 0xBD5783530EA76D91 /* P6 = -3.341396294294381580191e-13 */ + .quad 0x3D3ADCC75CBD2A03 /* P7 = +9.543447641637910477850e-14 */ + .quad 0xBD1ADCDA46BE5F17 /* P8 = -2.385887543769010971872e-14 */ + .quad 0x3CF87D77650BE5B8 /* P9 = +5.437895260471143131391e-15 */ + .quad 0xBCD395AE6E74C6D2 /* P10 = -1.087168847335561258239e-15 */ + .quad 0xC02B000000000000 /* B = -13.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C97A8A295292858 /* PL0 = +8.208271151146829171896e-17 */ + .quad 0x3FEFFFFFFFFFEE19 /* PH0 = +9.999999999994911847878e-01 */ + .quad 0x3D71E642BB008F95 /* P1 = +1.017466259229268282255e-12 */ + .quad 0xBD71E642BAFEEC54 /* P2 = -1.017466259207593392022e-12 */ + .quad 0x3D67DDAE41647741 /* P3 = +6.783108169938233581038e-13 */ + .quad 0xBD57DDAE4230F34B /* P4 = -3.391554091734942426856e-13 */ + .quad 0x3D4317C33FAE2536 /* P5 = +1.356626669455791324801e-13 */ + .quad 0xBD2975040D3E26B9 /* P6 = -4.522088139411435138867e-14 */ + .quad 0x3D0D155DCD0F0AFB /* P7 = +1.291565189902030307333e-14 */ + .quad 0xBCED157247832B20 /* P8 = -3.228947666403019234175e-15 */ + .quad 0x3CCA83D70F607C28 /* P9 = +7.359390959466796619024e-16 */ + .quad 0xBCA5343952C1E19E /* P10 = -1.471323041436694087188e-16 */ + .quad 0xC02D000000000000 /* B = -14.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C9B7876CBC5306E /* PL0 = +9.530765996816607711732e-17 */ + .quad 0x3FEFFFFFFFFFFD93 /* PH0 = +9.999999999999310551502e-01 */ + .quad 0x3D436121E2640D76 /* P1 = +1.376990843765503869546e-13 */ + .quad 0xBD436121E26250EA /* P2 = -1.376990843736775811281e-13 */ + .quad 0x3D39D6D7CA259186 /* P3 = +9.179938654047876451320e-14 */ + .quad 0xBD29D6D7CB0327CE /* P4 = -4.589969336188563660531e-14 */ + .quad 0x3D14ABE4DC31244A /* P5 = +1.835994545584345768382e-14 */ + .quad 0xBCFB8FDB82AB6BB7 /* P6 = -6.119980791767901275443e-15 */ + .quad 0x3CDF7CF757491B60 /* P7 = +1.747943407988343076526e-15 */ + .quad 0xBCBF7D0D833640FB /* P8 = -4.369905470133249448357e-16 */ + .quad 0x3C9CB512F6BDC754 /* P9 = +9.959852600692493655511e-17 */ + .quad 0xBC76F50AB1B0E9BA /* P10 = -1.991219205936492089091e-17 */ + .quad 0xC02F000000000000 /* B = -15.5 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C6FFE15D5F78543 /* PL0 = +1.387454417328248962819e-17 */ + .quad 0x3FEFFFFFFFFFFFE1 /* PH0 = +9.999999999999965583086e-01 */ + .quad 0x3CFEE00288B99C26 /* P1 = +6.855635762864742358597e-15 */ + .quad 0xBCFEE0027D060EE2 /* P2 = -6.855635607998342735403e-15 */ + .quad 0x3CF4954AA23148A2 /* P3 = +4.570381865813341696777e-15 */ + .quad 0xBCE4954B5DAD3010 /* P4 = -2.285192173571711474199e-15 */ + .quad 0x3CD07883DD8793BD /* P5 = +9.143109661358222028007e-16 */ + .quad 0xBCB5F5F4BB87ADCF /* P6 = -3.047668447080103869032e-16 */ + .quad 0x3C98F1A905097685 /* P7 = +8.654183371862458774513e-17 */ + .quad 0xBC78F2D585007222 /* P8 = -2.163943551222030413627e-17 */ + .quad 0x3C58A37CC5082B5F /* P9 = +5.342649626494471588064e-18 */ + .quad 0xBC33AE7917F94D17 /* P10 = -1.066938163384541013918e-18 */ + .quad 0xC031000000000000 /* B = -17 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x3C91BF1D80474F0F /* PL0 = +6.157069264461989135096e-17 */ + .quad 0x3FEFFFFFFFFFFFFE /* PH0 = +9.999999999999997779554e-01 */ + .quad 0x3CB72071400E6275 /* P1 = +3.209478247225075961360e-16 */ + .quad 0xBCB72071400A9F37 /* P2 = -3.209478247103497434502e-16 */ + .quad 0x3CAED5EC39A77629 /* P3 = +2.139652050028423711308e-16 */ + .quad 0xBC9ED5EC3B530600 /* P4 = -1.069826028468029104719e-16 */ + .quad 0x3C88AB2BFED159DE /* P5 = +4.279326904335078988705e-17 */ + .quad 0xBC70721D1220B3FC /* P6 = -1.426441958074916244382e-17 */ + .quad 0x3C52C96049721FB8 /* P7 = +4.073700029965821523731e-18 */ + .quad 0xBC32C971215735DC /* P8 = -1.018438939975201710113e-18 */ + .quad 0x3C112EF658AB41A9 /* P9 = +2.328791246104218830028e-19 */ + .quad 0xBBEB7B598C6AD3DE /* P10 = -4.655603964908654142787e-20 */ + .quad 0xC03287E0C98F84E5 /* B = -18.530774 */ + .quad 0x3FF0000000000000 /* A = +1 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* PL0 = +0.000000000000000000000e-01 */ + .quad 0x3FF0000000000000 /* PH0 = +1.000000000000000000000e+00 */ + .quad 0x0000000000000000 /* P1 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P2 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P3 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P4 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P5 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P6 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P7 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P8 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P9 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* P10 = +0.000000000000000000000e-01 */ + .quad 0x0000000000000000 /* B = +0 */ + .quad 0x0000000000000000 /* A = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .quad 0x0000000000000000 /* Align value = +0 */ + .align 32 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 /* _dbSignMask */ + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff /* _dbAbsMask */ + .align 32 + .long 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000 /* _iExpMantMask */ + .align 32 + .long 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000 /* _iExpMask */ + .align 32 + .long 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000 /* _iMinIdxOfsMask */ + .align 32 + .long 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000 /* _iMaxIdxMask */ + .align 32 + .type __svml_dtanh_data_internal,@object + .size __svml_dtanh_data_internal,.-__svml_dtanh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core-avx2.S new file mode 100644 index 0000000000..92fb24a640 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized tanh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_tanh _ZGVeN8v_tanh_avx2_wrapper +#include "../svml_d_tanh8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core.c new file mode 100644 index 0000000000..495cb1f4fc --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized tanh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_tanh +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_tanh, __GI__ZGVeN8v_tanh, __redirect__ZGVeN8v_tanh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core_avx512.S new file mode 100644 index 0000000000..01fc22ba6f --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_tanh8_core_avx512.S @@ -0,0 +1,472 @@ +/* Function tanh vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * NOTE: Since the hyperbolic tangent function is odd + * (tanh(x) = -tanh(-x)), below algorithm deals with the absolute + * value of the argument |x|: tanh(x) = sign(x) * tanh(|x|) + * + * We use a table lookup method to compute tanh(|x|). + * The basic idea is to split the input range into a number of subintervals + * and to approximate tanh(.) with a polynomial on each of them. + * + * IEEE SPECIAL CONDITIONS: + * x = [+,-]0, r = [+,-]0 + * x = +Inf, r = +1 + * x = -Inf, r = -1 + * x = QNaN, r = QNaN + * x = SNaN, r = QNaN + * + * + * ALGORITHM DETAILS + * We handle special values in a callout function, aside from main path + * computations. "Special" for this algorithm are: + * INF, NAN, |x| > HUGE_THRESHOLD + * + * + * Main path computations are organized as follows: + * Actually we split the interval [0, SATURATION_THRESHOLD) + * into a number of subintervals. On each subinterval we approximate tanh(.) + * with a minimax polynomial of pre-defined degree. Polynomial coefficients + * are computed beforehand and stored in table. We also use + * + * y := |x| + B, + * + * here B depends on subinterval and is used to make argument + * closer to zero. + * We also add large fake interval [SATURATION_THRESHOLD, HUGE_THRESHOLD], + * where 1.0 + 0.0*y + 0.0*y^2 ... coefficients are stored - just to + * preserve main path computation logic but return 1.0 for all arguments. + * + * Hence reconstruction looks as follows: + * we extract proper polynomial and range reduction coefficients + * (Pj and B), corresponding to subinterval, to which |x| belongs, + * and return + * + * r := sign(x) * (P0 + P1 * y + ... + Pn * y^n) + * + * NOTE: we use multiprecision technique to multiply and sum the first + * K terms of the polynomial. So Pj, j = 0..K are stored in + * table each as a pair of target precision numbers (Pj and PLj) to + * achieve wider than target precision. + * + * + */ + +/* Offsets for data table __svml_dtanh_data_internal + */ +#define _dC 0 +#define _dP0 128 +#define _dP1 256 +#define _dP2 384 +#define _dP3 512 +#define _dP4 640 +#define _dP5 768 +#define _dP6 896 +#define _dP7 1024 +#define _dP8 1152 +#define _dP9 1280 +#define _dP10 1408 +#define _dP11 1536 +#define _dP12 1664 +#define _dP13 1792 +#define _dP14 1920 +#define _dP15 2048 +#define _dP16 2176 +#define _dP17 2304 +#define _iExpMantMask_UISA 2432 +#define _iMinIdxOfsMask_UISA 2496 +#define _iMaxIdxMask_UISA 2560 +#define _dbSignMask 2624 +#define _dbAbsMask 2688 +#define _iExpMantMask 2752 +#define _iExpMask 2816 +#define _iMinIdxOfsMask 2880 +#define _iMaxIdxMask 2944 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_tanh_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $320, %rsp + vpsrlq $32, %zmm0, %zmm4 + vmovups %zmm0, (%rsp) + vmovups __svml_dtanh_data_internal(%rip), %zmm14 + vmovups _dP0+__svml_dtanh_data_internal(%rip), %zmm15 + vpmovqd %zmm4, %ymm5 + +/* Constant loading */ + vandpd _dbAbsMask+__svml_dtanh_data_internal(%rip), %zmm0, %zmm13 + vandpd _dbSignMask+__svml_dtanh_data_internal(%rip), %zmm0, %zmm3 + +/* Here huge arguments, INF and NaNs are filtered out to callout. */ + vpand _iExpMantMask_UISA+__svml_dtanh_data_internal(%rip), %ymm5, %ymm7 + vmovups _dP2+__svml_dtanh_data_internal(%rip), %zmm0 + vmovups _dP16+__svml_dtanh_data_internal(%rip), %zmm4 + vmovups _dP15+__svml_dtanh_data_internal(%rip), %zmm5 + vmovups %zmm3, 64(%rsp) + vmovups _dP3+__svml_dtanh_data_internal(%rip), %zmm3 + vpsubd _iMinIdxOfsMask_UISA+__svml_dtanh_data_internal(%rip), %ymm7, %ymm8 + +/* if VMIN, VMAX is defined for I type */ + vxorps %ymm9, %ymm9, %ymm9 + vpmaxsd %ymm9, %ymm8, %ymm10 + vpminsd _iMaxIdxMask_UISA+__svml_dtanh_data_internal(%rip), %ymm10, %ymm11 + vpsrld $19, %ymm11, %ymm12 + vmovups _dP12+__svml_dtanh_data_internal(%rip), %zmm8 + vmovups _dP11+__svml_dtanh_data_internal(%rip), %zmm9 + vmovups _dP10+__svml_dtanh_data_internal(%rip), %zmm10 + vmovups _dP9+__svml_dtanh_data_internal(%rip), %zmm11 + vpmovzxdq %ymm12, %zmm2 + vmovups _dP8+__svml_dtanh_data_internal(%rip), %zmm12 + vpermt2pd _dP2+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm0 + vpermt2pd _dC+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm14 + vpermt2pd _dP16+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm4 + vpermt2pd _dP15+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm5 + vsubpd {rn-sae}, %zmm14, %zmm13, %zmm1 + vpermt2pd _dP12+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm8 + vpermt2pd _dP11+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm9 + vpermt2pd _dP10+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm10 + vpermt2pd _dP9+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm11 + vpermt2pd _dP8+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm12 + vpermt2pd _dP3+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm3 + vpermt2pd _dP0+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm15 + vmovups %zmm0, 192(%rsp) + vmovups _dP17+__svml_dtanh_data_internal(%rip), %zmm0 + vmovups _dP7+__svml_dtanh_data_internal(%rip), %zmm13 + vmovups _dP6+__svml_dtanh_data_internal(%rip), %zmm14 + vmovups %zmm3, 256(%rsp) + vmovups _dP5+__svml_dtanh_data_internal(%rip), %zmm3 + vmovups %zmm15, 128(%rsp) + vmovups _dP4+__svml_dtanh_data_internal(%rip), %zmm15 + vpermt2pd _dP17+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm0 + vpermt2pd _dP7+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm13 + vpermt2pd _dP6+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm14 + vpermt2pd _dP5+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm3 + vpermt2pd _dP4+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm15 + vfmadd213pd {rn-sae}, %zmm4, %zmm1, %zmm0 + vpcmpgtd _iExpMask+__svml_dtanh_data_internal(%rip), %ymm7, %ymm6 + vmovmskps %ymm6, %edx + vmovups _dP14+__svml_dtanh_data_internal(%rip), %zmm6 + vfmadd213pd {rn-sae}, %zmm5, %zmm1, %zmm0 + vmovups _dP13+__svml_dtanh_data_internal(%rip), %zmm7 + vpermt2pd _dP14+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm6 + vpermt2pd _dP13+64+__svml_dtanh_data_internal(%rip), %zmm2, %zmm7 + vfmadd213pd {rn-sae}, %zmm6, %zmm1, %zmm0 + vmovups 256(%rsp), %zmm2 + vfmadd213pd {rn-sae}, %zmm7, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm8, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm9, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm10, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm11, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm12, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm13, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm14, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm3, %zmm1, %zmm0 + vmovups 128(%rsp), %zmm3 + vfmadd213pd {rn-sae}, %zmm15, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm2, %zmm1, %zmm0 + vmovups 192(%rsp), %zmm2 + vfmadd213pd {rn-sae}, %zmm2, %zmm1, %zmm0 + vfmadd213pd {rn-sae}, %zmm3, %zmm1, %zmm0 + vorpd 64(%rsp), %zmm0, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups (%rsp), %zmm1 + vmovups %zmm0, 128(%rsp) + vmovups %zmm1, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -304; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xfe, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -312; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xfe, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -320; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xfe, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -304; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xfe, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -312; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xfe, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -320; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xfe, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call tanh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_tanh_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dtanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _dC[16][2]; + __declspec(align(64)) VUINT32 _dP0[16][2]; + __declspec(align(64)) VUINT32 _dP1[16][2]; + __declspec(align(64)) VUINT32 _dP2[16][2]; + __declspec(align(64)) VUINT32 _dP3[16][2]; + __declspec(align(64)) VUINT32 _dP4[16][2]; + __declspec(align(64)) VUINT32 _dP5[16][2]; + __declspec(align(64)) VUINT32 _dP6[16][2]; + __declspec(align(64)) VUINT32 _dP7[16][2]; + __declspec(align(64)) VUINT32 _dP8[16][2]; + __declspec(align(64)) VUINT32 _dP9[16][2]; + __declspec(align(64)) VUINT32 _dP10[16][2]; + __declspec(align(64)) VUINT32 _dP11[16][2]; + __declspec(align(64)) VUINT32 _dP12[16][2]; + __declspec(align(64)) VUINT32 _dP13[16][2]; + __declspec(align(64)) VUINT32 _dP14[16][2]; + __declspec(align(64)) VUINT32 _dP15[16][2]; + __declspec(align(64)) VUINT32 _dP16[16][2]; + __declspec(align(64)) VUINT32 _dP17[16][2]; + __declspec(align(64)) VUINT32 _iExpMantMask_UISA[16][1]; + __declspec(align(64)) VUINT32 _iMinIdxOfsMask_UISA[16][1]; + __declspec(align(64)) VUINT32 _iMaxIdxMask_UISA[16][1]; + __declspec(align(64)) VUINT32 _dbSignMask[8][2]; + __declspec(align(64)) VUINT32 _dbAbsMask[8][2]; + __declspec(align(64)) VUINT32 _iExpMantMask[16][1]; + __declspec(align(64)) VUINT32 _iExpMask[16][1]; + __declspec(align(64)) VUINT32 _iMinIdxOfsMask[16][1]; + __declspec(align(64)) VUINT32 _iMaxIdxMask[16][1]; +} __svml_dtanh_data_internal; +#endif +__svml_dtanh_data_internal: + /*== _dC ==*/ + .quad 0x0000000000000000, 0x3fcc000000000000, 0x3fd4000000000000, 0x3fdc000000000000 + .quad 0x3fe4000000000000, 0x3fec000000000000, 0x3ff4000000000000, 0x3ffc000000000000 + .quad 0x4004000000000000, 0x400c000000000000, 0x4014000000000000, 0x401c000000000000 + .quad 0x4024000000000000, 0x402c000000000000, 0x4034000000000000, 0x0000000000000000 + /*== p0 ==*/ + .align 64 + .quad 0x0000000000000000, 0x3fcb8fd0416a7c92, 0x3fd35f98a0ea650e, 0x3fda5729ee488037 + .quad 0x3fe1bf47eabb8f95, 0x3fe686650b8c2015, 0x3feb2523bb6b2dee, 0x3fee1fbf97e33527 + .quad 0x3fef9258260a71c2, 0x3feff112c63a9077, 0x3fefff419668df11, 0x3feffffc832750f2 + .quad 0x3feffffffdc96f35, 0x3fefffffffffcf58, 0x3ff0000000000000, 0x3ff0000000000000 + /*== p1 ==*/ + .align 64 + .quad 0x0000000000000000, 0x3c65e23ebcd3bcbe, 0xbc4c600bac3adf00, 0x3c6c44091785d040 + .quad 0x3c8221d7a6e3674b, 0x3c69f89d2cf6b85c, 0x3c73b3e9ec0b8f1c, 0xbc7f8d4b0428aada + .quad 0xbc7c52d880cf43c0, 0x3c7dd36e37096480, 0x3c7b4f6380c442ca, 0xbc729755de470096 + .quad 0x3c84cf852845efbd, 0x3c6fc4fb440a5378, 0xbc63981083b55870, 0x0000000000000000 + /*== p2 ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3fee842ca3f08532, 0x3fed11574af58f1b, 0x3fea945b9c24e4f9 + .quad 0x3fe6284c3374f815, 0x3fe02500a09f8d6e, 0x3fd1f25131e3a8c0, 0x3fbd22ca1c24a139 + .quad 0x3f9b3afe1fba5c76, 0x3f6dd37d19b22b21, 0x3f27ccec13a9ef96, 0x3ecbe6c3f33250ae + .quad 0x3e41b4865394f75f, 0x3d8853f01bda5f28, 0x3c73953c0197ef58, 0x0000000000000000 + /*== p3 ==*/ + .align 64 + .quad 0xbbf0b3ea3fdfaa19, 0xbfca48aaeb53bc21, 0xbfd19921f4329916, 0xbfd5e0f09bef8011 + .quad 0xbfd893b59c35c882, 0xbfd6ba7cb7576538, 0xbfce7291743d7555, 0xbfbb6d85a01efb80 + .quad 0xbf9addae58c7141a, 0xbf6dc59376c7aa19, 0xbf27cc5e74677410, 0xbecbe6c0e8b4cc87 + .quad 0xbe41b486526b0565, 0xbd8853f01bef63a4, 0xbc73955be519be31, 0x0000000000000000 + /*== p4 ==*/ + .align 64 + .quad 0xbfd5555555555555, 0xbfd183afc292ba11, 0xbfcc1a4b039c9bfa, 0xbfc16e1e6d8d0be6 + .quad 0xbf92426c751e48a2, 0x3fb4f152b2bad124, 0x3fbbba40cbef72be, 0x3fb01ba038be6a3d + .quad 0x3f916df44871efc8, 0x3f63c6869dfc8870, 0x3f1fb9aef915d828, 0x3ec299d1e27c6e11 + .quad 0x3e379b5ddcca334c, 0x3d8037f57bc62c9a, 0x3c6a2d4b50a2cff7, 0x0000000000000000 + /*== p5 ==*/ + .align 64 + .quad 0xbce6863ee44ed636, 0x3fc04dcd0476c75e, 0x3fc43d3449a80f08, 0x3fc5c26f3699b7e7 + .quad 0x3fc1a686f6ab2533, 0x3faf203c316ce730, 0xbf89c7a02788557c, 0xbf98157e26e0d541 + .quad 0xbf807b55c1c7d278, 0xbf53a18d5843190f, 0xbf0fb6bbc89b1a5b, 0xbeb299c9c684a963 + .quad 0xbe279b5dd4fb3d01, 0xbd7037f57ae72aa6, 0xbc5a2ca2bba78e86, 0x0000000000000000 + /*== p6 ==*/ + .align 64 + .quad 0x3fc1111111112ab5, 0x3fb5c19efdfc08ad, 0x3fa74c98dc34fbac, 0xbf790d6a8eff0a77 + .quad 0xbfac3c021789a786, 0xbfae2196b7326859, 0xbf93a7a011ff8c2a, 0x3f6e4709c7e8430e + .quad 0x3f67682afa611151, 0x3f3ef2ee77717cbf, 0x3ef95a4482f180b7, 0x3e9dc2c27da3b603 + .quad 0x3e12e2afd9f7433e, 0x3d59f320348679ba, 0x3c44b61d9bbcc940, 0x0000000000000000 + /*== p7 ==*/ + .align 64 + .quad 0xbda1ea19ddddb3b4, 0xbfb0b8df995ce4df, 0xbfb2955cf41e8164, 0xbfaf9d05c309f7c6 + .quad 0xbf987d27ccff4291, 0x3f8b2ca62572b098, 0x3f8f1cf6c7f5b00a, 0x3f60379811e43dd5 + .quad 0xbf4793826f78537e, 0xbf2405695e36240f, 0xbee0e08de39ce756, 0xbe83d709ba5f714e + .quad 0xbdf92e3fc5ee63e0, 0xbd414cc030f2110e, 0xbc2ba022e8d82a87, 0x0000000000000000 + /*== p8 ==*/ + .align 64 + .quad 0xbfaba1ba1990520b, 0xbf96e37bba52f6fc, 0x3ecff7df18455399, 0x3f97362834d33a4e + .quad 0x3f9e7f8380184b45, 0x3f869543e7c420d4, 0xbf7326bd4914222a, 0xbf5fc15b0a9d98fa + .quad 0x3f14cffcfa69fbb6, 0x3f057e48e5b79d10, 0x3ec33b66d7d77264, 0x3e66ac4e578b9b10 + .quad 0x3ddcc74b8d3d5c42, 0x3d23c589137f92b4, 0x3c107f8e2c8707a1, 0x0000000000000000 + /*== p9 ==*/ + .align 64 + .quad 0xbe351ca7f096011f, 0x3f9eaaf3320c3851, 0x3f9cf823fe761fc1, 0x3f9022271754ff1f + .quad 0xbf731fe77c9c60af, 0xbf84a6046865ec7d, 0xbf4ca3f1f2b9192b, 0x3f4c77dee0afd227 + .quad 0x3f04055bce68597a, 0xbee2bf0cb4a71647, 0xbea31eaafe73efd5, 0xbe46abb02c4368ed + .quad 0xbdbcc749ca8079dd, 0xbd03c5883836b9d2, 0xbbf07a5416264aec, 0x0000000000000000 + /*== p10 ==*/ + .align 64 + .quad 0x3f9664f94e6ac14e, 0xbf94d3343bae39dd, 0xbf7bc748e60df843, 0xbf8c89372b43ba85 + .quad 0xbf8129a092de747a, 0x3f60c85b4d538746, 0x3f5be9392199ec18, 0xbf2a0c68a4489f10 + .quad 0xbf00462601dc2faa, 0x3eb7b6a219dea9f4, 0x3e80cbcc8d4c5c8a, 0x3e2425bb231a5e29 + .quad 0x3d9992a4beac8662, 0x3ce191ba5ed3fb67, 0x3bc892450bad44c4, 0x0000000000000000 + /*== p11 ==*/ + .align 64 + .quad 0xbea8c4c1fd7852fe, 0xbfccce16b1046f13, 0xbf81a16f224bb7b6, 0xbf62cbf00406bc09 + .quad 0x3f75b29bb02cf69b, 0x3f607df0f9f90c17, 0xbf4b852a6e0758d5, 0xbf0078c63d1b8445 + .quad 0x3eec12eadd55be7a, 0xbe6fa600f593181b, 0xbe5a3c935dce3f7d, 0xbe001c6d95e3ae96 + .quad 0xbd74755a00ea1fd3, 0xbcbc1c6c063bb7ac, 0xbba3be9a4460fe00, 0x0000000000000000 + /*== p12 ==*/ + .align 64 + .quad 0xbf822404577aa9dd, 0x403d8b07f7a82aa3, 0xbf9f44ab92fbab0a, 0x3fb2eac604473d6a + .quad 0x3f45f87d903aaac8, 0xbf5e104671036300, 0x3f19bc98ddf0f340, 0x3f0d4304bc9246e8 + .quad 0xbed13c415f7b9d41, 0xbe722b8d9720cdb0, 0x3e322666d739bec0, 0x3dd76a553d7e7918 + .quad 0x3d4de0fa59416a39, 0x3c948716cf3681b4, 0x3b873f9f2d2fda99, 0x0000000000000000 + /*== p13 ==*/ + .align 64 + .quad 0xbefdd99a221ed573, 0x4070593a3735bab4, 0xbfccab654e44835e, 0x3fd13ed80037dbac + .quad 0xbf6045b9076cc487, 0x3f2085ee7e8ac170, 0x3f23524622610430, 0xbeff12a6626911b4 + .quad 0x3eab9008bca408af, 0x3e634df71865f620, 0xbe05bb1bcf83ca73, 0xbdaf2ac143fb6762 + .quad 0xbd23eae52a3dbf57, 0xbc6b5e3e9ca0955e, 0xbb5eca68e2c1ba2e, 0x0000000000000000 + /*== p14 ==*/ + .align 64 + .quad 0x3f6e3be689423841, 0xc0d263511f5baac1, 0x40169f73b15ebe5c, 0xc025c1dd41cd6cb5 + .quad 0xbf58fd89fe05e0d1, 0x3f73f7af01d5af7a, 0xbf1e40bdead17e6b, 0x3ee224cd6c4513e5 + .quad 0xbe24b645e68eeaa3, 0xbe4abfebfb72bc83, 0x3dd51c38f8695ed3, 0x3d8313ac38c6832b + .quad 0x3cf7787935626685, 0x3c401ffc49c6bc29, 0xbabf0b21acfa52ab, 0x0000000000000000 + /*== p15 ==*/ + .align 64 + .quad 0xbf2a1306713a4f3a, 0xc1045e509116b066, 0x4041fab9250984ce, 0xc0458d090ec3de95 + .quad 0xbf74949d60113d63, 0x3f7c9fd6200d0ade, 0x3f02cd40e0ad0a9f, 0xbe858ab8e019f311 + .quad 0xbe792fa6323b7cf8, 0x3e2df04d67876402, 0xbd95c72be95e4d2c, 0xbd55a89c30203106 + .quad 0xbccad6b3bb9eff65, 0xbc12705ccd3dd884, 0xba8e0a4c47ae75f5, 0x0000000000000000 + /*== p16 ==*/ + .align 64 + .quad 0xbf55d7e76dc56871, 0x41528c38809c90c7, 0xc076d57fb5190b02, 0x4085f09f888f8ada + .quad 0x3fa246332a2fcba5, 0xbfb29d851a896fcd, 0x3ed9065ae369b212, 0xbeb8e1ba4c98a030 + .quad 0x3e6ffd0766ad4016, 0xbe0c63c29f505f5b, 0xbd7fab216b9e0e49, 0x3d2826b62056aa27 + .quad 0x3ca313e31762f523, 0x3bea37aa21895319, 0x3ae5c7f1fd871496, 0x0000000000000000 + /*== p17 ==*/ + .align 64 + .quad 0x3f35e67ab76a26e7, 0x41848ee0627d8206, 0xc0a216d618b489ec, 0x40a5b89107c8af4f + .quad 0x3fb69d8374520eda, 0xbfbded519f981716, 0xbef02d288b5b3371, 0x3eb290981209c1a6 + .quad 0xbe567e924bf5ff6e, 0x3de3f7f7de6b0eb6, 0x3d69ed18bae3ebbc, 0xbcf7534c4f3dfa71 + .quad 0xbc730b73f1eaff20, 0xbbba2cff8135d462, 0xbab5a71b5f7d9035, 0x0000000000000000 + .align 64 + .long 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000 /* _iExpMantMask_UISA */ + .align 64 + .long 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000, 0x3fc00000 /* _iMinIdxOfsMask_UISA */ + .align 64 + .long 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000, 0x00780000 /* _iMaxIdxMask_UISA */ + .align 64 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 /* _dbSignMask */ + .align 64 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff /* _dbAbsMask */ + .align 64 + .long 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000, 0x7ffe0000 /* _iExpMantMask */ + .align 64 + .long 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000 /* _iExpMask */ + .align 64 + .long 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000, 0x3fbe0000 /* _iMinIdxOfsMask */ + .align 64 + .long 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000, 0x00760000 /* _iMaxIdxMask */ + .align 64 + .type __svml_dtanh_data_internal,@object + .size __svml_dtanh_data_internal,.-__svml_dtanh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core-avx2.S new file mode 100644 index 0000000000..76bb22229e --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized tanhf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_tanhf _ZGVeN16v_tanhf_avx2_wrapper +#include "../svml_s_tanhf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core.c new file mode 100644 index 0000000000..cec4c7ed74 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized tanhf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_tanhf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_tanhf, __GI__ZGVeN16v_tanhf, + __redirect__ZGVeN16v_tanhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core_avx512.S new file mode 100644 index 0000000000..b6bdf97cc5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf16_core_avx512.S @@ -0,0 +1,381 @@ +/* Function tanhf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * NOTE: Since the hyperbolic tangent function is odd + * (tanh(x) = -tanh(-x)), below algorithm deals with the absolute + * value of the argument |x|: tanh(x) = sign(x) * tanh(|x|) + * + * We use a table lookup method to compute tanh(|x|). + * The basic idea is to split the input range into a number of subintervals + * and to approximate tanh(.) with a polynomial on each of them. + * + * IEEE SPECIAL CONDITIONS: + * x = [+,-]0, r = [+,-]0 + * x = +Inf, r = +1 + * x = -Inf, r = -1 + * x = QNaN, r = QNaN + * x = SNaN, r = QNaN + * + * + * ALGORITHM DETAILS + * We handle special values in a callout function, aside from main path + * computations. "Special" for this algorithm are: + * INF, NAN, |x| > HUGE_THRESHOLD + * + * + * Main path computations are organized as follows: + * Actually we split the interval [0, SATURATION_THRESHOLD) + * into a number of subintervals. On each subinterval we approximate tanh(.) + * with a minimax polynomial of pre-defined degree. Polynomial coefficients + * are computed beforehand and stored in table. We also use + * + * y := |x| + B, + * + * here B depends on subinterval and is used to make argument + * closer to zero. + * We also add large fake interval [SATURATION_THRESHOLD, HUGE_THRESHOLD], + * where 1.0 + 0.0*y + 0.0*y^2 ... coefficients are stored - just to + * preserve main path computation logic but return 1.0 for all arguments. + * + * Hence reconstruction looks as follows: + * we extract proper polynomial and range reduction coefficients + * (Pj and B), corresponding to subinterval, to which |x| belongs, + * and return + * + * r := sign(x) * (P0 + P1 * y + ... + Pn * y^n) + * + * NOTE: we use multiprecision technique to multiply and sum the first + * K terms of the polynomial. So Pj, j = 0..K are stored in + * table each as a pair of target precision numbers (Pj and PLj) to + * achieve wider than target precision. + * + * + */ + +/* Offsets for data table __svml_stanh_data_internal + */ +#define _sC 0 +#define _sP0 128 +#define _sP2 256 +#define _sP3 384 +#define _sP4 512 +#define _sP5 640 +#define _sP6 768 +#define _sP7 896 +#define _iExpMantMask_UISA 1024 +#define _iMinIdxOfsMask_UISA 1088 +#define _iMaxIdxMask_UISA 1152 +#define _sSignMask 1216 +#define _sAbsMask 1280 +#define _iExpMantMask 1344 +#define _iExpMask 1408 +#define _iMinIdxOfsMask 1472 +#define _iMaxIdxMask 1536 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_tanhf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovaps %zmm0, %zmm1 + vmovups __svml_stanh_data_internal(%rip), %zmm9 + vmovups _sP6+__svml_stanh_data_internal(%rip), %zmm11 + vmovups _sP5+__svml_stanh_data_internal(%rip), %zmm12 + vmovups _sP4+__svml_stanh_data_internal(%rip), %zmm13 + vmovups _sP3+__svml_stanh_data_internal(%rip), %zmm14 + vmovups _sP2+__svml_stanh_data_internal(%rip), %zmm15 + vpternlogd $255, %zmm2, %zmm2, %zmm2 + vandps _sAbsMask+__svml_stanh_data_internal(%rip), %zmm1, %zmm8 + vandps _sSignMask+__svml_stanh_data_internal(%rip), %zmm1, %zmm0 + +/* Here huge arguments, INF and NaNs are filtered out to callout. */ + vpandd _iExpMantMask_UISA+__svml_stanh_data_internal(%rip), %zmm1, %zmm3 + vpsubd _iMinIdxOfsMask_UISA+__svml_stanh_data_internal(%rip), %zmm3, %zmm4 + vpcmpd $2, _iExpMask+__svml_stanh_data_internal(%rip), %zmm3, %k1 + +/* + * small table specific variables * + * Constant loading + */ + vpxord %zmm5, %zmm5, %zmm5 + +/* if VMIN, VMAX is defined for I type */ + vpmaxsd %zmm5, %zmm4, %zmm6 + vpminsd _iMaxIdxMask_UISA+__svml_stanh_data_internal(%rip), %zmm6, %zmm7 + vpsrld $21, %zmm7, %zmm10 + vmovups _sP7+__svml_stanh_data_internal(%rip), %zmm4 + vpermt2ps _sC+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm9 + vpermt2ps _sP6+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm11 + vpermt2ps _sP7+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm4 + vpermt2ps _sP5+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm12 + vpermt2ps _sP4+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm13 + vpermt2ps _sP3+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm14 + vpermt2ps _sP2+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm15 + vpandnd %zmm3, %zmm3, %zmm2{%k1} + vptestmd %zmm2, %zmm2, %k0 + vmovups _sP0+__svml_stanh_data_internal(%rip), %zmm3 + vsubps {rn-sae}, %zmm9, %zmm8, %zmm2 + kmovw %k0, %edx + vfmadd213ps {rn-sae}, %zmm11, %zmm2, %zmm4 + vpermt2ps _sP0+64+__svml_stanh_data_internal(%rip), %zmm10, %zmm3 + vfmadd213ps {rn-sae}, %zmm12, %zmm2, %zmm4 + vfmadd213ps {rn-sae}, %zmm13, %zmm2, %zmm4 + vfmadd213ps {rn-sae}, %zmm14, %zmm2, %zmm4 + vfmadd213ps {rn-sae}, %zmm15, %zmm2, %zmm4 + vfmadd213ps {rn-sae}, %zmm3, %zmm2, %zmm4 + vorps %zmm0, %zmm4, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm1 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm1, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call tanhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_tanhf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_stanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(64)) VUINT32 _sC[32][1]; + __declspec(align(64)) VUINT32 _sP0[32][1]; + __declspec(align(64)) VUINT32 _sP2[32][1]; + __declspec(align(64)) VUINT32 _sP3[32][1]; + __declspec(align(64)) VUINT32 _sP4[32][1]; + __declspec(align(64)) VUINT32 _sP5[32][1]; + __declspec(align(64)) VUINT32 _sP6[32][1]; + __declspec(align(64)) VUINT32 _sP7[32][1]; + __declspec(align(64)) VUINT32 _iExpMantMask_UISA[16][1]; + __declspec(align(64)) VUINT32 _iMinIdxOfsMask_UISA[16][1]; + __declspec(align(64)) VUINT32 _iMaxIdxMask_UISA[16][1]; + __declspec(align(64)) VUINT32 _sSignMask[16][1]; + __declspec(align(64)) VUINT32 _sAbsMask[16][1]; + __declspec(align(64)) VUINT32 _iExpMantMask[16][1]; + __declspec(align(64)) VUINT32 _iExpMask[16][1]; + __declspec(align(64)) VUINT32 _iMinIdxOfsMask[16][1]; + __declspec(align(64)) VUINT32 _iMaxIdxMask[16][1]; +} __svml_stanh_data_internal; +#endif +__svml_stanh_data_internal: + /*== _sC ==*/ + .long 0x00000000, 0x3d700000, 0x3d900000, 0x3db00000 + .long 0x3dd00000, 0x3df00000, 0x3e100000, 0x3e300000 + .long 0x3e500000, 0x3e700000, 0x3e900000, 0x3eb00000 + .long 0x3ed00000, 0x3ef00000, 0x3f100000, 0x3f300000 + .long 0x3f500000, 0x3f700000, 0x3f900000, 0x3fb00000 + .long 0x3fd00000, 0x3ff00000, 0x40100000, 0x40300000 + .long 0x40500000, 0x40700000, 0x40900000, 0x40b00000 + .long 0x40d00000, 0x40f00000, 0x41100000, 0x00000000 + /*== p0 ==*/ + .align 64 + .long 0x00000000, 0x3d6fb9c9, 0x3d8fc35f, 0x3daf9169 + .long 0x3dcf49ab, 0x3deee849, 0x3e0f0ee8, 0x3e2e4984 + .long 0x3e4d2f8e, 0x3e6bb32e, 0x3e8c51cd, 0x3ea96163 + .long 0x3ec543f1, 0x3edfd735, 0x3f028438, 0x3f18abf0 + .long 0x3f2bc480, 0x3f3bec1c, 0x3f4f2e5b, 0x3f613c53 + .long 0x3f6ce37d, 0x3f743c4f, 0x3f7a5feb, 0x3f7dea85 + .long 0x3f7f3b3d, 0x3f7fb78c, 0x3f7fefd4, 0x3f7ffdd0 + .long 0x3f7fffb4, 0x3f7ffff6, 0x3f7fffff, 0x3f800000 + /*== p2 ==*/ + .align 64 + .long 0x3f800000, 0x3f7f1f84, 0x3f7ebd11, 0x3f7e1e5f + .long 0x3f7d609f, 0x3f7c842d, 0x3f7b00e5, 0x3f789580 + .long 0x3f75b8ad, 0x3f726fd9, 0x3f6cc59b, 0x3f63fb92 + .long 0x3f59ff97, 0x3f4f11d7, 0x3f3d7573, 0x3f24f360 + .long 0x3f0cbfe7, 0x3eec1a69, 0x3eb0a801, 0x3e6753a2 + .long 0x3e132f1a, 0x3db7e7d3, 0x3d320845, 0x3c84d3d4 + .long 0x3bc477b7, 0x3b10d3da, 0x3a01601e, 0x388c1a3b + .long 0x3717b0da, 0x35a43bce, 0x338306c6, 0x00000000 + /*== p3 ==*/ + .align 64 + .long 0xb0343c7b, 0xbd6ee69d, 0xbd8f0da7, 0xbdae477d + .long 0xbdcd2a1f, 0xbdeba80d, 0xbe0c443b, 0xbe293cf3 + .long 0xbe44f282, 0xbe5f3651, 0xbe81c7c0, 0xbe96d7ca + .long 0xbea7fb8e, 0xbeb50e9e, 0xbec12efe, 0xbec4be92 + .long 0xbebce070, 0xbead510e, 0xbe8ef7d6, 0xbe4b8704 + .long 0xbe083237, 0xbdaf7449, 0xbd2e1ec4, 0xbc83bf06 + .long 0xbbc3e0b5, 0xbb10aadc, 0xba0157db, 0xb88c18f2 + .long 0xb717b096, 0xb5a43bae, 0xb383012c, 0x00000000 + /*== p4 ==*/ + .align 64 + .long 0xbeaaaaa5, 0xbeab0612, 0xbea7f01f, 0xbea4e120 + .long 0xbea387b7, 0xbea15962, 0xbe9d57f7, 0xbe976b5a + .long 0xbe90230d, 0xbe880dff, 0xbe7479b3, 0xbe4c3d88 + .long 0xbe212482, 0xbdeb8cba, 0xbd5e78ad, 0x3c6b5e6e + .long 0x3d839143, 0x3dc21ee1, 0x3de347af, 0x3dcbec96 + .long 0x3d99ef2d, 0x3d542ea1, 0x3cdde701, 0x3c2cca67 + .long 0x3b81cb27, 0x3ac073a1, 0x39ac3032, 0x383a94d9 + .long 0x36ca081d, 0x355abd4c, 0x332b3cb6, 0x00000000 + /*== p5 ==*/ + .align 64 + .long 0xb76dd6b9, 0xbe1c276d, 0x3c1dcf2f, 0x3dc1a78d + .long 0x3d96f985, 0x3da2b61b, 0x3dc13397, 0x3dd2f670 + .long 0x3df48a0a, 0x3e06c5a8, 0x3e1a3aba, 0x3e27c405 + .long 0x3e2e78d0, 0x3e2c3e44, 0x3e1d3097, 0x3df4a8f4 + .long 0x3da38508, 0x3d31416a, 0x3b562657, 0xbcaeeac9 + .long 0xbcce9419, 0xbcaaeac4, 0xbc49e7d0, 0xbba71ddd + .long 0xbb003b0e, 0xba3f9a05, 0xb92c08a7, 0xb7ba9232 + .long 0xb64a0b0f, 0xb4dac169, 0xb2ab78ac, 0x00000000 + /*== p6 ==*/ + .align 64 + .long 0x3e0910e9, 0x43761143, 0x4165ecdc, 0xc190f756 + .long 0xc08c097d, 0xc02ba813, 0xbf7f6bda, 0x3f2b1dc0 + .long 0x3ece105d, 0x3f426a94, 0xbadb0dc4, 0x3da43b17 + .long 0xbd51ab88, 0xbcaea23d, 0xbd3b6d8d, 0xbd6caaad + .long 0xbd795bed, 0xbd5fddda, 0xbd038f3b, 0xbc1cad63 + .long 0x3abb4766, 0x3b95f10b, 0x3b825873, 0x3afaea66 + .long 0x3a49f878, 0x39996bf3, 0x388f3e6c, 0x371bb0e3 + .long 0x35a8a5e6, 0x34369b17, 0x322487b0, 0x00000000 + /*== p7 ==*/ + .align 64 + .long 0xbc0e2f66, 0x460bda12, 0x43d638ef, 0xc3e11c3e + .long 0xc2baa4e9, 0xc249da2d, 0xc1859b82, 0x40dd5b57 + .long 0x40494640, 0x40c730a8, 0xbf0f160e, 0x3e30e76f + .long 0xbea81387, 0xbdb26a1c, 0xbd351e57, 0xbb4c01a0 + .long 0x3c1d7bfb, 0x3c722cd1, 0x3c973f1c, 0x3c33a31b + .long 0x3b862ef4, 0x3a27b3d0, 0xba3b5907, 0xba0efc22 + .long 0xb97f9f0f, 0xb8c8af50, 0xb7bdddfb, 0xb64f2950 + .long 0xb4e085b1, 0xb3731dfa, 0xb15a1f04, 0x00000000 + .align 64 + .long 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000, 0x7fe00000 /* _iExpMantMask_UISA */ + .align 64 + .long 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000, 0x3d400000 /* _iMinIdxOfsMask_UISA */ + .align 64 + .long 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000, 0x03e00000 /* _iMaxIdxMask_UISA */ + .align 64 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSignMask */ + .align 64 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _sAbsMask */ + .align 64 + .long 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000 /* _iExpMantMask */ + .align 64 + .long 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000 /* _iExpMask */ + .align 64 + .long 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000 /* _iMinIdxOfsMask */ + .align 64 + .long 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000 /* _iMaxIdxMask */ + .align 64 + .type __svml_stanh_data_internal,@object + .size __svml_stanh_data_internal,.-__svml_stanh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core-sse2.S new file mode 100644 index 0000000000..cd290db337 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized tanhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_tanhf _ZGVbN4v_tanhf_sse2 +#include "../svml_s_tanhf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core.c new file mode 100644 index 0000000000..2dcb1f3676 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized tanhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_tanhf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_tanhf, __GI__ZGVbN4v_tanhf, + __redirect__ZGVbN4v_tanhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core_sse4.S new file mode 100644 index 0000000000..3a0ce20473 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf4_core_sse4.S @@ -0,0 +1,832 @@ +/* Function tanhf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * NOTE: Since the hyperbolic tangent function is odd + * (tanh(x) = -tanh(-x)), below algorithm deals with the absolute + * value of the argument |x|: tanh(x) = sign(x) * tanh(|x|) + * + * We use a table lookup method to compute tanh(|x|). + * The basic idea is to split the input range into a number of subintervals + * and to approximate tanh(.) with a polynomial on each of them. + * + * IEEE SPECIAL CONDITIONS: + * x = [+,-]0, r = [+,-]0 + * x = +Inf, r = +1 + * x = -Inf, r = -1 + * x = QNaN, r = QNaN + * x = SNaN, r = QNaN + * + * + * ALGORITHM DETAILS + * We handle special values in a callout function, aside from main path + * computations. "Special" for this algorithm are: + * INF, NAN, |x| > HUGE_THRESHOLD + * + * + * Main path computations are organized as follows: + * Actually we split the interval [0, SATURATION_THRESHOLD) + * into a number of subintervals. On each subinterval we approximate tanh(.) + * with a minimax polynomial of pre-defined degree. Polynomial coefficients + * are computed beforehand and stored in table. We also use + * + * y := |x| + B, + * + * here B depends on subinterval and is used to make argument + * closer to zero. + * We also add large fake interval [SATURATION_THRESHOLD, HUGE_THRESHOLD], + * where 1.0 + 0.0*y + 0.0*y^2 ... coefficients are stored - just to + * preserve main path computation logic but return 1.0 for all arguments. + * + * Hence reconstruction looks as follows: + * we extract proper polynomial and range reduction coefficients + * (Pj and B), corresponding to subinterval, to which |x| belongs, + * and return + * + * r := sign(x) * (P0 + P1 * y + ... + Pn * y^n) + * + * NOTE: we use multiprecision technique to multiply and sum the first + * K terms of the polynomial. So Pj, j = 0..K are stored in + * table each as a pair of target precision numbers (Pj and PLj) to + * achieve wider than target precision. + * + * + */ + +/* Offsets for data table __svml_stanh_data_internal + */ +#define _dbP 0 +#define _sSignMask 4288 +#define _sAbsMask 4304 +#define _iExpMantMask 4320 +#define _iExpMask 4336 +#define _iMinIdxOfsMask 4352 +#define _iMaxIdxMask 4368 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_tanhf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm5 + +/* Here huge arguments, INF and NaNs are filtered out to callout. */ + movdqu _iExpMantMask+__svml_stanh_data_internal(%rip), %xmm9 + lea _dbP+16+__svml_stanh_data_internal(%rip), %r8 + pand %xmm5, %xmm9 + +/* if VMIN, VMAX is defined for I type */ + pxor %xmm7, %xmm7 + movdqa %xmm9, %xmm6 + psubd _iMinIdxOfsMask+__svml_stanh_data_internal(%rip), %xmm9 + +/* + * small table specific variables * + * Constant loading + */ + movdqu _iMaxIdxMask+__svml_stanh_data_internal(%rip), %xmm10 + movdqa %xmm9, %xmm11 + movdqa %xmm9, %xmm8 + pcmpgtd %xmm10, %xmm11 + pcmpgtd %xmm7, %xmm8 + movdqa %xmm11, %xmm14 + pand %xmm8, %xmm9 + andps %xmm11, %xmm10 + andnps %xmm9, %xmm14 + orps %xmm10, %xmm14 + psrld $14, %xmm14 + movd %xmm14, %edx + pshufd $1, %xmm14, %xmm12 + pshufd $2, %xmm14, %xmm13 + movd %xmm12, %ecx + pshufd $3, %xmm14, %xmm15 + movups _sAbsMask+__svml_stanh_data_internal(%rip), %xmm3 + movslq %edx, %rdx + andps %xmm5, %xmm3 + movslq %ecx, %rcx + pcmpgtd _iExpMask+__svml_stanh_data_internal(%rip), %xmm6 + movd %xmm13, %esi + movups -16(%rdx,%r8), %xmm2 + movaps %xmm2, %xmm0 + movd %xmm15, %edi + movmskps %xmm6, %eax + movups -16(%rcx,%r8), %xmm6 + unpcklpd %xmm6, %xmm0 + unpckhpd %xmm6, %xmm2 + cvtps2pd %xmm3, %xmm6 + movhlps %xmm3, %xmm3 + cvtps2pd %xmm3, %xmm3 + movslq %esi, %rsi + movslq %edi, %rdi + movups (%rcx,%r8), %xmm8 + movups (%rdx,%r8), %xmm12 + movups (%rsi,%r8), %xmm13 + movaps %xmm12, %xmm10 + movups (%rdi,%r8), %xmm9 + movaps %xmm13, %xmm11 + unpckhpd %xmm8, %xmm12 + unpckhpd %xmm9, %xmm13 + mulpd %xmm6, %xmm12 + mulpd %xmm3, %xmm13 + unpcklpd %xmm8, %xmm10 + unpcklpd %xmm9, %xmm11 + addpd %xmm10, %xmm12 + addpd %xmm11, %xmm13 + mulpd %xmm6, %xmm12 + mulpd %xmm3, %xmm13 + addpd %xmm2, %xmm12 + movups -16(%rsi,%r8), %xmm1 + movups -16(%rdi,%r8), %xmm7 + movaps %xmm1, %xmm14 + unpckhpd %xmm7, %xmm1 + addpd %xmm1, %xmm13 + mulpd %xmm12, %xmm6 + mulpd %xmm13, %xmm3 + addpd %xmm0, %xmm6 + unpcklpd %xmm7, %xmm14 + addpd %xmm14, %xmm3 + cvtpd2ps %xmm6, %xmm0 + cvtpd2ps %xmm3, %xmm1 + movups _sSignMask+__svml_stanh_data_internal(%rip), %xmm4 + movlhps %xmm1, %xmm0 + andps %xmm5, %xmm4 + orps %xmm4, %xmm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 eax xmm0 xmm5 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm5, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 eax + + xorl %edx, %edx + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %edx, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %eax, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call tanhf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_tanhf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_stanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(16)) VUINT32 _dbP[(134*4)][2]; + __declspec(align(16)) VUINT32 _sSignMask[4][1]; + __declspec(align(16)) VUINT32 _sAbsMask[4][1]; + __declspec(align(16)) VUINT32 _iExpMantMask[4][1]; + __declspec(align(16)) VUINT32 _iExpMask[4][1]; + __declspec(align(16)) VUINT32 _iMinIdxOfsMask[4][1]; + __declspec(align(16)) VUINT32 _iMaxIdxMask[4][1]; +} __svml_stanh_data_internal; +#endif +__svml_stanh_data_internal: + /* Pol_000: err=7.93e-09, x in [0.0000000; 0.0312500]. */ + .quad 0x0000000000000000 /* A00 = +0.000000000000000000000e-01 */ + .quad 0x3FF00000022C70EB /* A01 = +1.000000008097283510367e+00 */ + .quad 0xBED00E878CFFA194 /* A02 = -3.828228912518614443549e-06 */ + .quad 0xBFD551766D0607A9 /* A03 = -3.330970825846813476723e-01 */ + .quad 0xBE53D60CE3E4C297 /* A00 = -1.847383956330407336230e-08 */ + .quad 0x3FF000024177CF5C /* A01 = +1.000002151235967140508e+00 */ + .quad 0xBF1758BC94A51A25 /* A02 = -8.906031613262943753568e-05 */ + .quad 0xBFD53EAE67E0D4F0 /* A03 = -3.319507612644221339337e-01 */ + .quad 0xBE5A9E47EF32D6FE /* A00 = -2.479020984039698285657e-08 */ + .quad 0x3FF00002DA983057 /* A01 = +1.000002721676556793895e+00 */ + .quad 0xBF1BD953509E94AA /* A02 = -1.062352277175377670507e-04 */ + .quad 0xBFD53BDB562EEDD5 /* A03 = -3.317783681520414806876e-01 */ + .quad 0xBE6191BBE496D294 /* A00 = -3.272532162914017685901e-08 */ + .quad 0x3FF0000390492017 /* A01 = +1.000003398528866105366e+00 */ + .quad 0xBF20727E814A57CE /* A02 = -1.254825043772153972919e-04 */ + .quad 0xBFD538DE060A6F22 /* A03 = -3.315959033004550748913e-01 */ + .quad 0xBE66DAFA2A893A25 /* A00 = -4.257146219278012568149e-08 */ + .quad 0x3FF0000465E08CD1 /* A01 = +1.000004194219219266770e+00 */ + .quad 0xBF2341C765EF91B6 /* A02 = -1.469188600530365522261e-04 */ + .quad 0xBFD535B6841FAF9E /* A03 = -3.314033785124993469751e-01 */ + .quad 0xBE6D5794E361E964 /* A00 = -5.465394929765249413434e-08 */ + .quad 0x3FF000055EE2A0CB /* A01 = +1.000005121846742950353e+00 */ + .quad 0xBF265E6C77E66C8B /* A02 = -1.706607253709506650304e-04 */ + .quad 0xBFD53264DDCCEDA6 /* A03 = -3.312008062382240103361e-01 */ + .quad 0xBE729C844D374A6E /* A00 = -6.933284462462096107184e-08 */ + .quad 0x3FF000067F019093 /* A01 = +1.000006195180536350264e+00 */ + .quad 0xBF29CC5348D6DCE5 /* A02 = -1.968242326435338705130e-04 */ + .quad 0xBFD52EE92121ED35 /* A03 = -3.309881995734998416658e-01 */ + .quad 0xBE775AEA17EAA872 /* A00 = -8.700465590574974405858e-08 */ + .quad 0x3FF00007CA1D66B8 /* A01 = +1.000007428656699559610e+00 */ + .quad 0xBF2D8F5EB98A2637 /* A02 = -2.255252009216044881395e-04 */ + .quad 0xBFD52B435CDF9128 /* A03 = -3.307655722585587376727e-01 */ + .quad 0xBE7D04DA28C343F0 /* A00 = -1.081040272327705484794e-07 */ + .quad 0x3FF000094443CCF5 /* A01 = +1.000008837375216730337e+00 */ + .quad 0xBF30D5B76C947AE5 /* A02 = -2.568791210978817814332e-04 */ + .quad 0xBFD52773A0776FAD /* A03 = -3.305329386764651045105e-01 */ + .quad 0xBE81DD77A12C51C7 /* A00 = -1.331054169875768625701e-07 */ + .quad 0x3FF0000AF1AFD2DA /* A01 = +1.000010437096696680470e+00 */ + .quad 0xBF331230624C1680 /* A02 = -2.910011410651516805537e-04 */ + .quad 0xBFD52379FC0B61DF /* A03 = -3.302903138515186909352e-01 */ + .quad 0xBE85D04EEEB3C435 /* A00 = -1.625247628488202841012e-07 */ + .quad 0x3FF0000CD6C9B1F2 /* A01 = +1.000012244238970726684e+00 */ + .quad 0xBF357F0742FADDD4 /* A02 = -3.280060509313874068243e-04 */ + .quad 0xBFD51F56806D0E81 /* A03 = -3.300377134475880880338e-01 */ + .quad 0xBE8A6E289B59681B /* A00 = -1.969211333326924655065e-07 */ + .quad 0x3FF0000EF8268F72 /* A01 = +1.000014275873550406715e+00 */ + .quad 0xBF381E277A1B747A /* A02 = -3.680082682942575423093e-04 */ + .quad 0xBFD51B093F1D6FD4 /* A03 = -3.297751537663746734808e-01 */ + .quad 0xBE8FCBC40EE9ABD5 /* A00 = -2.368983653301529373887e-07 */ + .quad 0x3FF000115A883B6C /* A01 = +1.000016549721943981410e+00 */ + .quad 0xBF3AF17AC974B3D9 /* A02 = -4.111218235774406434303e-04 */ + .quad 0xBFD516924A4C549C /* A03 = -3.295026517456081105450e-01 */ + .quad 0xBE92FFBC60A3F956 /* A00 = -2.831066871072026054144e-07 */ + .quad 0x3FF0001402DCED8A /* A01 = +1.000019084151832604590e+00 */ + .quad 0xBF3DFAE9390C4801 /* A02 = -4.574603454311488280083e-04 */ + .quad 0xBFD511F1B4D7DC3A /* A03 = -3.292202249571719585575e-01 */ + .quad 0xBE9690A22F96D5AD /* A00 = -3.362443262393081632612e-07 */ + .quad 0x3FF00016F63EFF5D /* A01 = +1.000021898173108825247e+00 */ + .quad 0xBF409E2C839605BB /* A02 = -5.071370461992499986334e-04 */ + .quad 0xBFD50D27924BEE00 /* A03 = -3.289278916051614487515e-01 */ + .quad 0xBE9AA56C65E72A73 /* A00 = -3.970591019557469835586e-07 */ + .quad 0x3FF0001A39F4A43E /* A01 = +1.000025011433776978009e+00 */ + .quad 0xBF425BD74C3D6667 /* A02 = -5.602647074553602319844e-04 */ + .quad 0xBFD50833F6E1ABA2 /* A03 = -3.286256705238718156536e-01 */ + .quad 0xBE9F4BD4FF1A83B0 /* A00 = -4.663500013744687071912e-07 */ + .quad 0x3FF0001DD36F9EC2 /* A01 = +1.000028444215715683896e+00 */ + .quad 0xBF44376634149405 /* A02 = -6.169556656102642569831e-04 */ + .quad 0xBFD50316F77EDEE5 /* A03 = -3.283135811757190158922e-01 */ + .quad 0xBEA3B625387BB079 /* A00 = -5.874486399249461304297e-07 */ + .quad 0x3FF00023E14CFBA9 /* A01 = +1.000034217911642153709e+00 */ + .quad 0xBF47392F923218D2 /* A02 = -7.087213783883111826306e-04 */ + .quad 0xBFD4FB1FACDEB938 /* A03 = -3.278273761924483942209e-01 */ + .quad 0xBEAA6E24F543500A /* A00 = -7.876828740601738750574e-07 */ + .quad 0x3FF0002D5C6E8412 /* A01 = +1.000043259679163742959e+00 */ + .quad 0xBF4BAF02BD7FDD70 /* A02 = -8.448375110664940040861e-04 */ + .quad 0xBFD4EFEE6527A7DE /* A03 = -3.271442401734229177279e-01 */ + .quad 0xBEB16E3EBE2157D0 /* A00 = -1.038947396133402500647e-06 */ + .quad 0x3FF00038990FEE2F /* A01 = +1.000053975962952312884e+00 */ + .quad 0xBF50569481C574CB /* A02 = -9.972048056490652716971e-04 */ + .quad 0xBFD4E419278DA2B4 /* A03 = -3.264220129263251113372e-01 */ + .quad 0xBEB6A7B6723165D4 /* A00 = -1.350350836279403750524e-06 */ + .quad 0x3FF00045CAB4158E /* A01 = +1.000066558657042303793e+00 */ + .quad 0xBF531D7C9C849108 /* A02 = -1.166698160951775212202e-03 */ + .quad 0xBFD4D7A0BB33B152 /* A03 = -3.256608799117844954552e-01 */ + .quad 0xBEBD0EE2A8654AFD /* A00 = -1.732000471561702711532e-06 */ + .quad 0x3FF00055276F18D6 /* A01 = +1.000081209219890521211e+00 */ + .quad 0xBF562FDBA3FB6C6C /* A02 = -1.354183666925102939860e-03 */ + .quad 0xBFD4CA85F1B93DB2 /* A03 = -3.248610363561638125773e-01 */ + .quad 0xBEC269D4036A207E /* A00 = -2.195047297096822741730e-06 */ + .quad 0x3FF00066E7DA6E4E /* A01 = +1.000098138500919997540e+00 */ + .quad 0xBF5991499FC36B3A /* A02 = -1.560518167983372759405e-03 */ + .quad 0xBFD4BCC9A72283D6 /* A03 = -3.240226871658341556426e-01 */ + .quad 0xBEC7154B6C09CFE1 /* A00 = -2.751729738565190291276e-06 */ + .quad 0x3FF0007B47086B80 /* A01 = +1.000117566559055148900e+00 */ + .quad 0xBF5D455433B4F8F4 /* A02 = -1.786548832412968197680e-03 */ + .quad 0xBFD4AE6CC1BFE145 /* A03 = -3.231460468373550942722e-01 */ + .quad 0xBECCA68CC64A0F8A /* A00 = -3.415415948561670285790e-06 */ + .quad 0x3FF00092827742F7 /* A01 = +1.000139722473418535387e+00 */ + .quad 0xBF60A7BF15A527AF /* A02 = -2.033112728132522705610e-03 */ + .quad 0xBFD49F703214084C /* A03 = -3.222313393636155876010e-01 */ + .quad 0xBED19E68676B241B /* A00 = -4.200644630977303616698e-06 */ + .quad 0x3FF000ACDA037B26 /* A01 = +1.000164844146362863597e+00 */ + .quad 0xBF62D99F836A02F8 /* A02 = -2.301036405072284102280e-03 */ + .quad 0xBFD48FD4F2B91B28 /* A03 = -3.212787981359945810311e-01 */ + .quad 0xBED57CF4B0C7AA54 /* A00 = -5.123164339408145209103e-06 */ + .quad 0x3FF000CA8FD9E1A1 /* A01 = +1.000193178099017865534e+00 */ + .quad 0xBF653A014548E686 /* A02 = -2.591135484433962181405e-03 */ + .quad 0xBFD47F9C0844B38F /* A03 = -3.202886658426046806447e-01 */ + .quad 0xBEDA012B1B1A41E2 /* A00 = -6.199971197454598722328e-06 */ + .quad 0x3FF000EBE868FDF4 /* A01 = +1.000224979259539459520e+00 */ + .quad 0xBF67CA9427E0A544 /* A02 = -2.904214255086275467410e-03 */ + .quad 0xBFD46EC6812ADB37 /* A03 = -3.192611943626845749655e-01 */ + .quad 0xBEDF3EAC5BF12194 /* A00 = -7.449344990702664567927e-06 */ + .quad 0x3FF001112A520784 /* A01 = +1.000260510744255704196e+00 */ + .quad 0xBF6A8D01ABDA4DC4 /* A02 = -3.241065277345108255891e-03 */ + .quad 0xBFD45D55759FFA4A /* A03 = -3.181966446572103146551e-01 */ + .quad 0xBEE2A541BC274267 /* A00 = -8.890883582164319970972e-06 */ + .quad 0x3FF0013A9E5961F2 /* A01 = +1.000300043631906721231e+00 */ + .quad 0xBF6D82ECD080C540 /* A02 = -3.602468994380686462264e-03 */ + .quad 0xBFD44B4A0779C0AD /* A03 = -3.170952866557950611259e-01 */ + .quad 0xBEE61D97609A27F4 /* A00 = -1.054553560499505625520e-05 */ + .quad 0x3FF001688F56A3AF /* A01 = +1.000343856731187974773e+00 */ + .quad 0xBF7056F8EFB683EC /* A02 = -3.989193351487490407647e-03 */ + .quad 0xBFD438A5620F0F74 /* A03 = -3.159573991399533543500e-01 */ + .quad 0xBEEA145429EDD370 /* A00 = -1.243563138839952927732e-05 */ + .quad 0x3FF0019B4A242A67 /* A01 = +1.000392236341804297339e+00 */ + .quad 0xBF7207D31CA78D9B /* A02 = -4.401993423445739288258e-03 */ + .quad 0xBFD42568BA16E7CD /* A03 = -3.147832696228050619602e-01 */ + .quad 0xBEEE96370D52680F /* A00 = -1.458491207477835326165e-05 */ + .quad 0x3FF001D31D8E4115 /* A01 = +1.000445476009251821736e+00 */ + .quad 0xBF73D4CC11EDC094 /* A02 = -4.841611050196221316400e-03 */ + .quad 0xBFD411954D8664E7 /* A03 = -3.135731942252974469021e-01 */ + .quad 0xBEF338C046215EF8 /* A00 = -1.833122622260562810219e-05 */ + .quad 0x3FF00230C32C2EC1 /* A01 = +1.000534784691737621998e+00 */ + .quad 0xBF76BD019BCC5DAF /* A02 = -5.551344188254799492943e-03 */ + .quad 0xBFD3F2C7156DC21E /* A03 = -3.116929730668135389848e-01 */ + .quad 0xBEF9B15EAE411EAE /* A00 = -2.450261207822986676092e-05 */ + .quad 0x3FF002C2DF057A4D /* A01 = +1.000674124886830940184e+00 */ + .quad 0xBF7B08CCD9AC1E30 /* A02 = -6.600189396301511801646e-03 */ + .quad 0xBFD3C7A7A114FED8 /* A03 = -3.090609620157755976777e-01 */ + .quad 0xBF00E36483C373B3 /* A00 = -3.221178528332122595812e-05 */ + .quad 0x3FF0036F419480D7 /* A01 = +1.000838524028997644777e+00 */ + .quad 0xBF7FD255D1777007 /* A02 = -7.768950679260206403087e-03 */ + .quad 0xBFD39A453911D6CE /* A03 = -3.062909180947429588215e-01 */ + .quad 0xBF05DFA04DD12059 /* A00 = -4.172046622180685472624e-05 */ + .quad 0x3FF00438B2A03D8D /* A01 = +1.001030633695197069599e+00 */ + .quad 0xBF828F8DBB4A9D10 /* A02 = -9.062869337255224921890e-03 */ + .quad 0xBFD36AAB704697D9 /* A03 = -3.033856007044711255993e-01 */ + .quad 0xBF0BF3E0C647DEFB /* A00 = -5.331544597092331081714e-05 */ + .quad 0x3FF005221063D36D /* A01 = +1.001253189109060359741e+00 */ + .quad 0xBF857A2CB3C96102 /* A02 = -1.048693584122917590862e-02 */ + .quad 0xBFD338E65BBB4FEC /* A03 = -3.003478904549854444639e-01 */ + .quad 0xBF11A506ED7C9D31 /* A00 = -6.730894835681591541979e-05 */ + .quad 0x3FF0062E4D0EA92A /* A01 = +1.001508999829250345925e+00 */ + .quad 0xBF88AB82C2761AF3 /* A02 = -1.204588085125866091241e-02 */ + .quad 0xBFD305028D6BD206 /* A03 = -2.971807843271395688234e-01 */ + .quad 0xBF1607C0922D9BF1 /* A00 = -8.403885708006799337092e-05 */ + .quad 0x3FF007606C341961 /* A01 = +1.001800940198869449560e+00 */ + .quad 0xBF8C25E6DA487BCF /* A02 = -1.374416688582682892494e-02 */ + .quad 0xBFD2CF0D0EE8F7B5 /* A03 = -2.938873906713255768075e-01 */ + .quad 0xBF1B3A8480A0A16D /* A00 = -1.038688061788578038307e-04 */ + .quad 0x3FF008BB802D02D6 /* A01 = +1.002131939589323561535e+00 */ + .quad 0xBF8FEB8AE99FD100 /* A02 = -1.558598065819483124983e-02 */ + .quad 0xBFD297135BD0911B /* A03 = -2.904709240558688843059e-01 */ + .quad 0xBF20ABB9BDB75C65 /* A00 = -1.271881327357976163798e-04 */ + .quad 0x3FF00A42A76D8CD1 /* A01 = +1.002504972472525901495e+00 */ + .quad 0xBF91FF3D752BB9E6 /* A02 = -1.757522609380570560722e-02 */ + .quad 0xBFD25D235C1F88B4 /* A03 = -2.869346999779154305799e-01 */ + .quad 0xBF243D3254425461 /* A00 = -1.544116913733432829448e-04 */ + .quad 0x3FF00BF909D1795E /* A01 = +1.002923048355647051011e+00 */ + .quad 0xBF94304E04D44942 /* A02 = -1.971551804042204897316e-02 */ + .quad 0xBFD2214B5E61CFA6 /* A03 = -2.832821294498394371075e-01 */ + .quad 0xBF286070011B61CE /* A00 = -1.859795307186510085994e-04 */ + .quad 0x3FF00DE1D5E1627E /* A01 = +1.003389201612804537689e+00 */ + .quad 0xBF9689D5F4163F59 /* A02 = -2.201017668045266231780e-02 */ + .quad 0xBFD1E39A11C3B42C /* A03 = -2.795167134743816728104e-01 */ + .quad 0xBF2D250B366A79E8 /* A00 = -2.223564326486314902259e-04 */ + .quad 0x3FF010003E134001 /* A01 = +1.003906481248123094829e+00 */ + .quad 0xBF990C9FF91F6F81 /* A02 = -2.446222265267250853271e-02 */ + .quad 0xBFD1A41E80084CDC /* A03 = -2.756420374218586655246e-01 */ + .quad 0xBF314DB5DDC2A30E /* A00 = -2.640313157465248123865e-04 */ + .quad 0x3FF012577608921B /* A01 = +1.004477940624503018441e+00 */ + .quad 0xBF9BB9626875B0C9 /* A02 = -2.707437288829409385849e-02 */ + .quad 0xBFD162E80768A9D0 /* A03 = -2.716617653228725615122e-01 */ + .quad 0xBF346A6133808864 /* A00 = -3.115165050094957730625e-04 */ + .quad 0x3FF014EAAFCC88A3 /* A01 = +1.005106627192198898157e+00 */ + .quad 0xBF9E90BEF9BF7419 /* A02 = -2.984903716411588595059e-02 */ + .quad 0xBFD12006545F7FAD /* A03 = -2.675796340899932457269e-01 */ + .quad 0xBF37F180DC3848EA /* A00 = -3.653468704395550778821e-04 */ + .quad 0x3FF017BD19147861 /* A01 = +1.005795572250939295955e+00 */ + .quad 0xBFA0C9A14C702E07 /* A02 = -3.278831537326359207851e-02 */ + .quad 0xBFD0DB895B650092 /* A03 = -2.633994476818851682154e-01 */ + .quad 0xBF3BEC6AAC6D7635 /* A00 = -4.260788377246944457107e-04 */ + .quad 0x3FF01AD1D884E719 /* A01 = +1.006547780778822565040e+00 */ + .quad 0xBFA260B2A1B1434A /* A02 = -3.589399551186163439542e-02 */ + .quad 0xBFD09581529E93D6 /* A03 = -2.591250712233067465817e-01 */ + .quad 0xBF4164E26167882B /* A00 = -5.308251737086202562063e-04 */ + .quad 0x3FF01FEF14B62B81 /* A01 = +1.007796364693348545316e+00 */ + .quad 0xBFA4EB014538AA42 /* A02 = -4.085544557559163403315e-02 */ + .quad 0xBFD029D36FEAF41F /* A03 = -2.525528519580024222613e-01 */ + .quad 0xBF46F6FFF4E53DC8 /* A00 = -7.008313930700277652464e-04 */ + .quad 0x3FF027CBB51CBBA0 /* A01 = +1.009715754956893363214e+00 */ + .quad 0xBFA89DEC9FEC112E /* A02 = -4.807986690687680864098e-02 */ + .quad 0xBFCF2A99464D0DB4 /* A03 = -2.434875100390009317053e-01 */ + .quad 0xBF4DCC9C4F66A4D9 /* A00 = -9.094012482836712945103e-04 */ + .quad 0x3FF030E7CFCCD583 /* A01 = +1.011939822882909068014e+00 */ + .quad 0xBFACAA3B95814081 /* A02 = -5.598627281199331645611e-02 */ + .quad 0xBFCDF78F156BE7CF /* A03 = -2.341173987004467604844e-01 */ + .quad 0xBF5308ED74E5C7A6 /* A00 = -1.161796466103906435435e-03 */ + .quad 0x3FF03B5986412ECB /* A01 = +1.014489674026594512313e+00 */ + .quad 0xBFB087EBA88DCC3F /* A02 = -6.457398285947223148806e-02 */ + .quad 0xBFCCBB9BD134862F /* A03 = -2.244753619680052991736e-01 */ + .quad 0xBF57FA23C00DF4B5 /* A00 = -1.463446533505758208674e-03 */ + .quad 0x3FF0473558A1BCC0 /* A01 = +1.017384859292903342975e+00 */ + .quad 0xBFB2E702BC6360EF /* A02 = -7.383744334527241048871e-02 */ + .quad 0xBFCB77D546379288 /* A03 = -2.145945160729250122955e-01 */ + .quad 0xBF5DD12971557F71 /* A00 = -1.819887610814388068450e-03 */ + .quad 0x3FF0548DDF5000A8 /* A01 = +1.020643112482540360020e+00 */ + .quad 0xBFB571B63DA186E1 /* A02 = -8.376635555898871710045e-02 */ + .quad 0xBFCA2D5202605148 /* A03 = -2.045080672838912594358e-01 */ + .quad 0xBF6252B1AD5D4F17 /* A00 = -2.236697221556737096709e-03 */ + .quad 0x3FF063738A910BF7 /* A01 = +1.024280110622155737232e+00 */ + .quad 0xBFB8270C8E6B601B /* A02 = -9.434584118878357184013e-02 */ + .quad 0xBFC8DD27D950A07E /* A03 = -1.942491351230763441116e-01 */ + .quad 0xBF66470C91730CFC /* A00 = -2.719425723258004842786e-03 */ + .quad 0x3FF073F468FCF331 /* A01 = +1.028309259519300633556e+00 */ + .quad 0xBFBB05C2952191E4 /* A02 = -1.055566419686964629854e-01 */ + .quad 0xBFC7886A770DE2BD /* A03 = -1.838505822486435070662e-01 */ + .quad 0xBF6AD114AC8E98EC /* A00 = -3.273525599485007861467e-03 */ + .quad 0x3FF0861BF53E5226 /* A01 = +1.032741506559554434119e+00 */ + .quad 0xBFBE0C4F9B461507 /* A02 = -1.173753503881763554650e-01 */ + .quad 0xBFC6302A037CDE3A /* A03 = -1.733448521642786954722e-01 */ + .quad 0xBF6FFBDE2A6C2AF8 /* A00 = -3.904279630096648551207e-03 */ + .quad 0x3FF099F2EB8E7DA3 /* A01 = +1.037585182326304034106e+00 */ + .quad 0xBFC09C74D192DDF0 /* A02 = -1.297746680554463516444e-01 */ + .quad 0xBFC4D571D8E3079F /* A03 = -1.627638157861470424859e-01 */ + .quad 0xBF72E8FDC0B952AA /* A00 = -4.616728994353872309042e-03 */ + .quad 0x3FF0AF7F273C9533 /* A01 = +1.042845872181101141152e+00 */ + .quad 0xBFC244C512736F10 /* A02 = -1.427236881344176033792e-01 */ + .quad 0xBFC379474F58B902 /* A03 = -1.521386277613104298645e-01 */ + .quad 0xBF762EABAF17395B /* A00 = -5.415602341101023557701e-03 */ + .quad 0x3FF0C6C3886F63FB /* A01 = +1.048526318502125631582e+00 */ + .quad 0xBFC3FDF9918EA12A /* A02 = -1.561881981590514389957e-01 */ + .quad 0xBFC21CA89ECAB895 /* A03 = -1.414995932913753196036e-01 */ + .quad 0xBF79D387CE5B2BAE /* A00 = -6.305246822828998107258e-03 */ + .quad 0x3FF0DFBFE2346376 /* A01 = +1.054626353847394337748e+00 */ + .quad 0xBFC5C6DA43602620 /* A02 = -1.701309994680721970894e-01 */ + .quad 0xBFC0C08BD8DB6631 /* A03 = -1.308760460731704100557e-01 */ + .quad 0xBF7DDBA8E8DA9060 /* A00 = -7.289562037531366334164e-03 */ + .quad 0x3FF0FA70F0D1B464 /* A01 = +1.061142864894713433443e+00 */ + .quad 0xBFC79E18D92BAA7C /* A02 = -1.845122394946264732241e-01 */ + .quad 0xBFBECBBBF74C2669 /* A03 = -1.202962378266875381749e-01 */ + .quad 0xBF81254E76EA25DA /* A00 = -8.371937755572145950511e-03 */ + .quad 0x3FF116D05835EBD0 /* A01 = +1.068069786618014660462e+00 */ + .quad 0xBFC982539E2ED224 /* A02 = -1.992897531869327609755e-01 */ + .quad 0xBFBC1B043C350159 /* A03 = -1.097872397413132278254e-01 */ + .quad 0xBF8391ACBA863403 /* A00 = -9.555196230190082448686e-03 */ + .quad 0x3FF134D4AA477FE2 /* A01 = +1.075398125794884141015e+00 */ + .quad 0xBFCB7218609FEAFB /* A02 = -2.144194099235717521079e-01 */ + .quad 0xBFB970A16CB88329 /* A03 = -9.937485603633135211599e-02 */ + .quad 0xBF87935088E48E8B /* A00 = -1.151144902957603431692e-02 */ + .quad 0x3FF1649892AD7DD3 /* A01 = +1.087059567413110938716e+00 */ + .quad 0xBFCE6971DDE75409 /* A02 = -2.375929196847723912089e-01 */ + .quad 0xBFB58291E88CB251 /* A03 = -8.402358939628952472223e-02 */ + .quad 0xBF8DB3A62C325325 /* A00 = -1.450280973794233242702e-02 */ + .quad 0x3FF1A9C900C6DEEA /* A01 = +1.103951457056548068891e+00 */ + .quad 0xBFD13DBC65B0E08E /* A02 = -2.693930619311765140012e-01 */ + .quad 0xBFB06696F62696D1 /* A03 = -6.406539449252625362252e-02 */ + .quad 0xBF92583699F2E27A /* A00 = -1.791463198307716858659e-02 */ + .quad 0x3FF1F451B85AA9F0 /* A01 = +1.122148246892376022288e+00 */ + .quad 0xBFD34FD5F8288180 /* A02 = -3.017477916164565954205e-01 */ + .quad 0xBFA6FB692825B683 /* A03 = -4.488686194495718900788e-02 */ + .quad 0xBF9641C26E673D6F /* A00 = -2.173522757385398448959e-02 */ + .quad 0x3FF24364DA5E2B07 /* A01 = +1.141453602790251542487e+00 */ + .quad 0xBFD564A5A5EF5890 /* A02 = -3.342680092295120530821e-01 */ + .quad 0xBF9B43712011A982 /* A03 = -2.662445791467283467968e-02 */ + .quad 0xBF9A901038EC2F39 /* A00 = -2.594018313816024226548e-02 */ + .quad 0x3FF2961356DFFEBA /* A01 = +1.161639537196534011088e+00 */ + .quad 0xBFD775EBB17198C7 /* A02 = -3.665723069046972759644e-01 */ + .quad 0xBF833B1A926CD462 /* A03 = -9.390075295963199591975e-03 */ + .quad 0xBF9F396A6A461B91 /* A00 = -3.049246095317987084727e-02 */ + .quad 0x3FF2EB53BAEF534B /* A01 = +1.182452898229899629357e+00 */ + .quad 0xBFD97DABF8AD8BBD /* A02 = -3.982953957076310058660e-01 */ + .quad 0x3F7B8F6A3E0F8837 /* A03 = +6.728568086119371925713e-03 */ + .quad 0xBFA21878590F8BAA /* A00 = -3.534294211546946951064e-02 */ + .quad 0x3FF34209790236E1 /* A01 = +1.203622315111197105253e+00 */ + .quad 0xBFDB764C0E71BECB /* A02 = -4.290952817018306997277e-01 */ + .quad 0x3F962FE0C03F84C0 /* A03 = +2.166701482190513949888e-02 */ + .quad 0xBFA4B36B9AD27ECC /* A00 = -4.043136849327097492868e-02 */ + .quad 0x3FF3990C5B12FC16 /* A01 = +1.224865298994477935679e+00 */ + .quad 0xBFDD5AABB0D01390 /* A02 = -4.586590983092770912322e-01 */ + .quad 0x3FA21DAF5CA162DB /* A03 = +3.538272863142363083844e-02 */ + .quad 0xBFA7645E4D7BF28B /* A00 = -4.568762489177399105378e-02 */ + .quad 0x3FF3EF2FD51C0D9F /* A01 = +1.245895225962932562069e+00 */ + .quad 0xBFDF26377E1B686E /* A02 = -4.867075664057044503963e-01 */ + .quad 0x3FA8803E756EE812 /* A03 = +4.785342391501513914509e-02 */ + .quad 0xBFAA210925C64413 /* A00 = -5.103329263796054643398e-02 */ + .quad 0x3FF44349F897D8E7 /* A01 = +1.266427966181760345066e+00 */ + .quad 0xBFE06A7B02C6D8E2 /* A02 = -5.129981092675530707226e-01 */ + .quad 0x3FAE3F194734F5D0 /* A03 = +5.907515520309980505687e-02 */ + .quad 0xBFACDE48F8A19BBB /* A00 = -5.638340029764018351832e-02 */ + .quad 0x3FF49439D5466582 /* A01 = +1.286187966447272845727e+00 */ + .quad 0xBFE131C7C1063DDC /* A02 = -5.373266954429101183166e-01 */ + .quad 0x3FB1ADEEC36AD805 /* A03 = +6.906025191241844940482e-02 */ + .quad 0xBFAF905D8F585680 /* A00 = -6.164829611604449866036e-02 */ + .quad 0x3FF4E0ED1FD27F99 /* A01 = +1.304913639360142818546e+00 */ + .quad 0xBFE1E7A859DC1D3D /* A02 = -5.595285182070380836095e-01 */ + .quad 0x3FB3ED018E4642A1 /* A03 = +7.783517573831001679086e-02 */ + .quad 0xBFB11595104160BA /* A00 = -6.673556944713512906198e-02 */ + .quad 0x3FF528650340490B /* A01 = +1.322361958217302513319e+00 */ + .quad 0xBFE28B14B40BC974 /* A02 = -5.794776455425521000109e-01 */ + .quad 0x3FB5DF49F5BAF6D7 /* A03 = +8.543836831355676453281e-02 */ + .quad 0xBFB2513A97344BA4 /* A00 = -7.155195418844911836587e-02 */ + .quad 0x3FF569BA0DB5EE14 /* A01 = +1.338312200124055273420e+00 */ + .quad 0xBFE31B53A8B67B20 /* A02 = -5.970857901737396389308e-01 */ + .quad 0x3FB787F297BB0544 /* A03 = +9.191814617499455275507e-02 */ + .quad 0xBFB37512E848FAFA /* A00 = -7.600515528700305112331e-02 */ + .quad 0x3FF5A41F33B403C8 /* A01 = +1.352568819013173495591e+00 */ + .quad 0xBFE397F6EA9A58A5 /* A02 = -6.123003561103997904880e-01 */ + .quad 0x3FB8EAA9FF25CA06 /* A03 = +9.733068923177520814782e-02 */ + .quad 0xBFB47B3E603AFC5D /* A00 = -8.000554894805263217439e-02 */ + .quad 0x3FF5D6E3EDE40487 /* A01 = +1.364963464031718975988e+00 */ + .quad 0xBFE400D5BCA6D631 /* A02 = -6.251019177058819709103e-01 */ + .quad 0x3FBA0B830ED567FE /* A03 = +1.017381583418739132707e-01 */ + .quad 0xBFB5BBFE8AC90496 /* A00 = -8.489981544791400103200e-02 */ + .quad 0x3FF612BA70107E95 /* A01 = +1.379572332145390989311e+00 */ + .quad 0xBFE477EAF1FA7693 /* A02 = -6.396383978023599814478e-01 */ + .quad 0x3FBB4784B7C08A95 /* A03 = +1.065600346196709652391e-01 */ + .quad 0xBFB6D5D940743939 /* A00 = -8.920057128509463473254e-02 */ + .quad 0x3FF644A8748F70CE /* A01 = +1.391762214006166953340e+00 */ + .quad 0xBFE4D646AB07EA37 /* A02 = -6.511567440459832267763e-01 */ + .quad 0x3FBC354F4E1D5292 /* A03 = +1.101884427747086558913e-01 */ + .quad 0xBFB7223D19E4F3D1 /* A00 = -9.036619074045339206069e-02 */ + .quad 0x3FF6518FEB42B7FA /* A01 = +1.394912642466350494175e+00 */ + .quad 0xBFE4ED86CB87498C /* A02 = -6.539949393430091184598e-01 */ + .quad 0x3FBC6D29F28CCA9B /* A03 = +1.110407082713131127205e-01 */ + .quad 0xBFB6878652FF6312 /* A00 = -8.800544287022329936754e-02 */ + .quad 0x3FF63948C302D040 /* A01 = +1.388985406648330922508e+00 */ + .quad 0xBFE4C4E2E7904E17 /* A02 = -6.490339777687407218920e-01 */ + .quad 0x3FBC127356CA1ABE /* A03 = +1.096565329445224612481e-01 */ + .quad 0xBFB4F5D18B0C91D6 /* A00 = -8.187589306596207427980e-02 */ + .quad 0x3FF5FD27EB7DD0B8 /* A01 = +1.374305648697413673176e+00 */ + .quad 0xBFE464E01A2B2FC6 /* A02 = -6.373138915164353601739e-01 */ + .quad 0x3FBB460547674A30 /* A03 = +1.065371798825160976065e-01 */ + .quad 0xBFB26642FA16A685 /* A00 = -7.187288861919156890412e-02 */ + .quad 0x3FF59F9BEDE1C95A /* A01 = +1.351467065073470141812e+00 */ + .quad 0xBFE3D67920C8FBEA /* A02 = -6.199308052381387046381e-01 */ + .quad 0x3FBA24F6A8D3CBC1 /* A03 = +1.021265184570401413078e-01 */ + .quad 0xBFADB5294794F097 /* A00 = -5.802277563859197656582e-02 */ + .quad 0x3FF523EA7B9CF453 /* A01 = +1.321268542159732772845e+00 */ + .quad 0xBFE322A8B55E35DB /* A02 = -5.979808370918208160205e-01 */ + .quad 0x3FB8C8673B1B3E37 /* A03 = +9.680791085269722928697e-02 */ + .quad 0xBFA4B7D661965C6A /* A00 = -4.046506825687219699450e-02 */ + .quad 0x3FF48DE3E2CE3122 /* A01 = +1.284641157110919085227e+00 */ + .quad 0xBFE251FED1A7F445 /* A02 = -5.725092024655472622285e-01 */ + .quad 0x3FB745699FCABDB9 /* A03 = +9.090290213747821701507e-02 */ + .quad 0xBF93E60456E4EE1D /* A00 = -1.943213253365004902773e-02 */ + .quad 0x3FF3E1A14E628A59 /* A01 = +1.242585474196536532432e+00 */ + .quad 0xBFE16C5AB660E876 /* A02 = -5.444768488007543094653e-01 */ + .quad 0x3FB5AD33AA8C188F /* A03 = +8.467410005332197397987e-02 */ + .quad 0x3F738C17C47C7961 /* A00 = +4.772274820224659853951e-03 */ + .quad 0x3FF3234DDE3BD146 /* A01 = +1.196119182682268355933e+00 */ + .quad 0xBFE078C0D77A9D3B /* A02 = -5.147403915952176722826e-01 */ + .quad 0x3FB40D74B3E276B8 /* A03 = +7.833032027925923568290e-02 */ + .quad 0x3FA0474BECC689C7 /* A00 = +3.179394975019849550746e-02 */ + .quad 0x3FF256FB4FA7D18A /* A01 = +1.146235762743432307076e+00 */ + .quad 0xBFDEFA8E3FB285E2 /* A02 = -4.840427038235174395098e-01 */ + .quad 0x3FB270C007493D59 /* A03 = +7.203293016322244446403e-02 */ + .quad 0x3FAF5BD51E479BDC /* A00 = +6.124750132203590768931e-02 */ + .quad 0x3FF18081D0B53BC5 /* A01 = +1.093873801484492647162e+00 */ + .quad 0xBFDCFE2439BD0C03 /* A02 = -4.530115665294831006626e-01 */ + .quad 0x3FB0DEFE5A45AFDD /* A03 = +6.590261176978580437424e-02 */ + .quad 0x3FB7BD5D2806EA26 /* A00 = +9.273321368429118805032e-02 */ + .quad 0x3FF0A369E35B4440 /* A01 = +1.039895904647224256223e+00 */ + .quad 0xBFDB04BC5C9951E7 /* A02 = -4.221640495573226181669e-01 */ + .quad 0x3FAEBBBAA9D6DEEF /* A03 = +6.002600978120919278380e-02 */ + .quad 0x3FC01BE411098DBC /* A00 = +1.258511622610124502941e-01 */ + .quad 0x3FEF85BDABC031C1 /* A01 = +9.850757936961188621083e-01 */ + .quad 0xBFD91521375097C2 /* A02 = -3.919146576102968682065e-01 */ + .quad 0x3FABE26F0086D982 /* A03 = +5.446192628317005068883e-02 */ + .quad 0x3FC481D7FF5776B9 /* A00 = +1.602125164781023347604e-01 */ + .quad 0x3FEDC3506C1E7218 /* A01 = +9.300920592973538347792e-01 */ + .quad 0xBFD7349A88DA7D4F /* A02 = -3.625856720409119104964e-01 */ + .quad 0x3FA936E2DFF8E2AE /* A03 = +4.924687370334389358018e-02 */ + .quad 0x3FC90471F96FA27A /* A00 = +1.954481571149420671141e-01 */ + .quad 0x3FEC0451601987A2 /* A01 = +8.755270840595026360376e-01 */ + .quad 0xBFD5671CD4B898DC /* A02 = -3.344184949259110251063e-01 */ + .quad 0x3FA6BB9594603B67 /* A03 = +4.439990459660841243261e-02 */ + .quad 0x3FCFD8ADB9ED944C /* A00 = +2.488000066615846384011e-01 */ + .quad 0x3FE978C073F6809A /* A01 = +7.959902062321078108909e-01 */ + .quad 0xBFD2DF7E00BCD5A9 /* A02 = -2.948908812716931060471e-01 */ + .quad 0x3FA3614033D490B2 /* A03 = +3.785133965200894456959e-02 */ + .quad 0x3FD4846A12AFE5A0 /* A00 = +3.205819303981005674586e-01 */ + .quad 0x3FE63A1147D40472 /* A01 = +6.945883181471244061100e-01 */ + .quad 0xBFCFA2268AD34450 /* A02 = -2.471359422548027318101e-01 */ + .quad 0x3F9F150201D9FFE0 /* A03 = +3.035357605267552383310e-02 */ + .quad 0x3FD9018641F82BEB /* A00 = +3.907180446846598154131e-01 */ + .quad 0x3FE33B7C220FFBDC /* A01 = +6.010113396913498995389e-01 */ + .quad 0xBFCA4E4187E29C86 /* A02 = -2.055131829740483584423e-01 */ + .quad 0x3F98C30CED19F8F4 /* A03 = +2.418155858185229434287e-02 */ + .quad 0x3FDD4B8255BEB078 /* A00 = +4.577337109901757905561e-01 */ + .quad 0x3FE0858B19D3A49B /* A01 = +5.163016800335243905451e-01 */ + .quad 0xBFC5BC929EACE564 /* A02 = -1.698172831327539045176e-01 */ + .quad 0x3F93A083CE57DE2B /* A03 = +1.916700312537337677621e-02 */ + .quad 0x3FE0A8E5E039295C /* A00 = +5.206174258576470315063e-01 */ + .quad 0x3FDC35E1234583FE /* A01 = +4.407885403107342225937e-01 */ + .quad 0xBFC1DE034E31AEB9 /* A02 = -1.395877963835710222629e-01 */ + .quad 0x3F8EFDEBB3471BDC /* A03 = +1.513275280821162888101e-02 */ + .quad 0x3FE2851B603CB2A5 /* A00 = +5.787484054213406503564e-01 */ + .quad 0x3FD7F4A44ABBB286 /* A01 = +3.743067483726821853551e-01 */ + .quad 0xBFBD3EEB67087DE7 /* A02 = -1.142413260026767657385e-01 */ + .quad 0x3F8864F38329E8BD /* A03 = +1.191129917173260922836e-02 */ + .quad 0x3FE437DBE3C34AC1 /* A00 = +6.318187187665317283702e-01 */ + .quad 0x3FD43F6F789441B5 /* A01 = +3.163717916040938438194e-01 */ + .quad 0xBFB7D92E7901B9A4 /* A02 = -9.315767721429907277653e-02 */ + .quad 0x3F8327ED342308E1 /* A03 = +9.353497651663324544136e-03 */ + .quad 0x3FE5C0977766D55C /* A00 = +6.797597248138731451661e-01 */ + .quad 0x3FD10B42A764D8F9 /* A01 = +2.663122782427219115142e-01 */ + .quad 0xBFB3633351D3D70F /* A02 = -7.573242900602060456716e-02 */ + .quad 0x3F7E079E30FF899C /* A03 = +7.331483779099558922843e-03 */ + .quad 0x3FE7202CE08A88C4 /* A00 = +7.226776490754436288455e-01 */ + .quad 0x3FCC973EB5662B01 /* A01 = +2.233656297433626314319e-01 */ + .quad 0xBFAF70A455F9920B /* A02 = -6.140626477716545211782e-02 */ + .quad 0x3F77812411CE99B6 /* A03 = +5.738392731393584730859e-03 */ + .quad 0x3FE85879424095B1 /* A00 = +7.608000082006382003286e-01 */ + .quad 0x3FC7E73BD1674D84 /* A01 = +1.867441914060742336190e-01 */ + .quad 0xBFA96F84E4BF333B /* A02 = -4.967894832916504993525e-02 */ + .quad 0x3F72606DDCA6E117 /* A03 = +4.486493251924870105662e-03 */ + .quad 0x3FE96BFE4957F4DD /* A00 = +7.944327766887472330737e-01 */ + .quad 0x3FC3ED4780D25478 /* A01 = +1.556786898624158421711e-01 */ + .quad 0xBFA489C5F9A56B58 /* A02 = -4.011362717093075458408e-02 */ + .quad 0x3F6CB5DC17E9AD2A /* A03 = +3.504686231556104931972e-03 */ + .quad 0x3FEA5D9CB2F41234 /* A00 = +8.239272589858672724006e-01 */ + .quad 0x3FC091A758374DCF /* A01 = +1.294449978582705440555e-01 */ + .quad 0xBFA08E436D4B5CE0 /* A02 = -3.233538350257858517978e-02 */ + .quad 0x3F666997AD53E6B7 /* A03 = +2.735897297154145629133e-03 */ + .quad 0x3FEB3060342CB850 /* A00 = +8.496552485501158713532e-01 */ + .quad 0x3FBB7D30BBC7DC1B /* A01 = +1.073790033768634993860e-01 */ + .quad 0xBF9AA6BA3443D9E3 /* A02 = -2.602663940430173170060e-02 */ + .quad 0x3F617CA764B7850B /* A03 = +2.134634914668814050648e-03 */ + .quad 0x3FEBE759A6A0C7B8 /* A00 = +8.719909910635044170135e-01 */ + .quad 0x3FB6C10DE6A703FF /* A01 = +8.888327485239243264115e-02 */ + .quad 0xBF956C566D8BE1F6 /* A02 = -2.092108768099084498138e-02 */ + .quad 0x3F5B46D1A4A59CF8 /* A03 = +1.664833764687232917079e-03 */ + .quad 0x3FEC858494887A04 /* A00 = +8.912985707318630268503e-01 */ + .quad 0x3FB2CC31F543394D /* A01 = +7.342827070099140762682e-02 */ + .quad 0xBF9133477FF69137 /* A02 = -1.679717749142747504343e-02 */ + .quad 0x3F5544482FBB4DA5 /* A03 = +1.298017973501022466823e-03 */ + .quad 0x3FED0DB59D0E32E9 /* A00 = +9.079235141267335551518e-01 */ + .quad 0x3FAF006BAFFC6EF4 /* A01 = +6.055008433597022787787e-02 */ + .quad 0xBF8B97146FA2B97A /* A02 = -1.347175565419144252499e-02 */ + .quad 0x3F5093B01F4CDC69 /* A03 = +1.011774057770665211434e-03 */ + .quad 0x3FEDB487C3EC457C /* A00 = +9.282873942012623835751e-01 */ + .quad 0x3FA7390C09D0BD1D /* A01 = +4.535710925881118044112e-02 */ + .quad 0xBF83D9F7C3181106 /* A02 = -9.693084374710735778846e-03 */ + .quad 0x3F46E34A0A3C0E64 /* A03 = +6.984817050299072134500e-04 */ + .quad 0x3FEE5FFCB4E6EB00 /* A00 = +9.492171796076434020506e-01 */ + .quad 0x3F9F4913ED00AADF /* A01 = +3.055220731782070861526e-02 */ + .quad 0xBF79670BD0E59B5C /* A02 = -6.201788097633133961528e-03 */ + .quad 0x3F3BC998EBCAF96D /* A03 = +4.240034429975534616304e-04 */ + .quad 0x3FEEDBA41E9542FE /* A00 = +9.643116566968215064293e-01 */ + .quad 0x3F94F5DD18D9C24D /* A01 = +2.046914543319848858727e-02 */ + .quad 0xBF7034896AA122B9 /* A02 = -3.956352980886528904192e-03 */ + .quad 0x3F30DCCB47810B39 /* A03 = +2.573009765038273091199e-04 */ + .quad 0x3FEF33F2882520ED /* A00 = +9.750912341196716903724e-01 */ + .quad 0x3F8BF37F2CF553FF /* A01 = +1.364802699996836392315e-02 */ + .quad 0xBF649F6F05A69619 /* A02 = -2.517430152880317534986e-03 */ + .quad 0x3F247623C950AAC9 /* A03 = +1.561087307505231250044e-04 */ + .quad 0x3FEF727757751741 /* A00 = +9.827229221489021115943e-01 */ + .quad 0x3F828E67912C4400 /* A01 = +9.060677640748693306705e-03 */ + .quad 0xBF5A2F51A806CC2C /* A02 = -1.598195784123355826789e-03 */ + .quad 0x3F18D35D7687E613 /* A03 = +9.470231965016282719549e-05 */ + .quad 0x3FEF9E6325C5942A /* A00 = +9.880843866091073568469e-01 */ + .quad 0x3F788AB117618F76 /* A01 = +5.991641772286606867914e-03 */ + .quad 0xBF5096EAB0B1EA89 /* A02 = -1.012543859160305046233e-03 */ + .quad 0x3F0E1E50EC4435AB /* A03 = +5.744633156910412119652e-05 */ + .quad 0x3FEFBD0784049369 /* A00 = +9.918248728250605994461e-01 */ + .quad 0x3F702BBD8294035F /* A01 = +3.947963975634432264028e-03 */ + .quad 0xBF44FB55E0F00593 /* A02 = -6.403130845457509273330e-04 */ + .quad 0x3F0244DCD723230A /* A03 = +3.484534217219031730379e-05 */ + .quad 0x3FEFD245E2366A43 /* A00 = +9.944180887426415926811e-01 */ + .quad 0x3F653D82EC088433 /* A01 = +2.592807490387838333795e-03 */ + .quad 0xBF3A7DF75E013CB8 /* A02 = -4.042366908878036561859e-04 */ + .quad 0x3EF6298E69F991CD /* A03 = +2.113564425911141559972e-05 */ + .quad 0x3FEFE0EAA508BC69 /* A00 = +9.962056372950317539861e-01 */ + .quad 0x3F5BD0771AF3FDDA /* A01 = +1.697651208644282514598e-03 */ + .quad 0xBF30B2E1254DE571 /* A02 = -2.548026725928887099328e-04 */ + .quad 0x3EEAE28B70EC0256 /* A03 = +1.281973848454955042307e-05 */ + .quad 0x3FEFEAF5303D7F96 /* A00 = +9.974313680831865536192e-01 */ + .quad 0x3F5229111365657E /* A01 = +1.108423877289460134782e-03 */ + .quad 0xBF250572D04DFE66 /* A02 = -1.603796628408704519168e-04 */ + .quad 0x3EE04E89BB57C981 /* A03 = +7.775682983689149966743e-06 */ + .quad 0x3FEFF1CF52F1CF44 /* A00 = +9.982678051005469122003e-01 */ + .quad 0x3F47A71316147CEB /* A01 = +7.218211359577819110842e-04 */ + .quad 0xBF1A6D7604055719 /* A02 = -1.008132248946049582547e-04 */ + .quad 0x3ED3C8047586A85C /* A03 = +4.716233739913014633626e-06 */ + .quad 0x3FEFF6770369EF69 /* A00 = +9.988360468555416149528e-01 */ + .quad 0x3F3EBB261180FBF0 /* A01 = +4.689186039321105101130e-04 */ + .quad 0xBF1097754FE19D7F /* A02 = -6.329206004950480057066e-05 */ + .quad 0x3EC7FEFF83BCA0A7 /* A03 = +2.860556404988488738366e-06 */ + .quad 0x3FEFF99D42371AC4 /* A00 = +9.992204945818561334647e-01 */ + .quad 0x3F33EB2AEC271F59 /* A01 = +3.039340773764907474054e-04 */ + .quad 0xBF04CF18E0FC0D79 /* A02 = -3.968996690952969588805e-05 */ + .quad 0x3EBD1BDBD6019BE9 /* A03 = +1.735021065507727833886e-06 */ + .quad 0x3FEFFBBCA32B0D91 /* A00 = +9.994795977476532700123e-01 */ + .quad 0x3F29C41E1615110A /* A01 = +1.965796209707565346710e-04 */ + .quad 0xBEFA11F93D9DCB5A /* A02 = -2.486248909101414873235e-05 */ + .quad 0x3EB1A7CA4546F7A7 /* A03 = +1.052345642723709228769e-06 */ + .quad 0x3FEFFD298B8E8DE2 /* A00 = +9.996535993308806045121e-01 */ + .quad 0x3F20A1C42D523C5B /* A01 = +1.268913244172078754520e-04 */ + .quad 0xBEF0507A364AFAE4 /* A02 = -1.555859070622834605755e-05 */ + .quad 0x3EA56ACA17E7CDF4 /* A03 = +6.382806956848098872313e-07 */ + .quad 0x3FEFFE1DC82BA5A3 /* A00 = +9.997700604991915929176e-01 */ + .quad 0x3F156E73B90F1769 /* A01 = +8.175450626798714452801e-05 */ + .quad 0xBEE4663579D0A09F /* A02 = -9.727122057226747625365e-06 */ + .quad 0x3E99FAF6FEC5D4C1 /* A03 = +3.871371052824002996020e-07 */ + .quad 0x3FEFFEF8D0BB5E81 /* A00 = +9.998745037837154514548e-01 */ + .quad 0x3F06686DA18D39C3 /* A01 = +4.273972098777251447726e-05 */ + .quad 0xBED46BC298073E90 /* A02 = -4.868731025855742842491e-06 */ + .quad 0x3E88E42286B9D0FD /* A03 = +1.854535328530838170114e-07 */ + .quad 0x3FEFFF8DBC68DDC7 /* A00 = +9.999455146670975791423e-01 */ + .quad 0x3EF26B2953A80AF0 /* A01 = +1.756534514108903368909e-05 */ + .quad 0xBEBFC4472D580F83 /* A02 = -1.893443529411295465239e-06 */ + .quad 0x3E72505B4553D19F /* A03 = +6.822456673547912277047e-08 */ + .quad 0x3FEFFFCED1276609 /* A00 = +9.999765477215883935358e-01 */ + .quad 0x3EDE1A94C7CC58F5 /* A01 = +7.177313020153979672606e-06 */ + .quad 0xBEA8A2C988744E57 /* A02 = -7.342066660497443762363e-07 */ + .quad 0x3E5AF30036BBBAF4 /* A03 = +2.509841882843541084885e-08 */ + .quad 0x3FEFFFEAFE70FCFC /* A00 = +9.999899835164849370983e-01 */ + .quad 0x3EC879175E3549F5 /* A01 = +2.917410471128503564412e-06 */ + .quad 0xBE930E36677D1813 /* A02 = -2.839493400307523115929e-07 */ + .quad 0x3E43D4005B42D48F /* A03 = +9.233192745401904898013e-09 */ + .quad 0x3ff0000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .align 16 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSignMask */ + .align 16 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _sAbsMask */ + .align 16 + .long 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000 /* _iExpMantMask */ + .align 16 + .long 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000 /* _iExpMask */ + .align 16 + .long 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000 /* _iMinIdxOfsMask */ + .align 16 + .long 0x04280000, 0x04280000, 0x04280000, 0x04280000 /* _iMaxIdxMask */ + .align 16 + .type __svml_stanh_data_internal,@object + .size __svml_stanh_data_internal,.-__svml_stanh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core-sse.S new file mode 100644 index 0000000000..a56795e3cd --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized tanhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_tanhf _ZGVdN8v_tanhf_sse_wrapper +#include "../svml_s_tanhf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core.c new file mode 100644 index 0000000000..fadcea36ab --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized tanhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_tanhf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_tanhf, __GI__ZGVdN8v_tanhf, + __redirect__ZGVdN8v_tanhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core_avx2.S new file mode 100644 index 0000000000..c19e6bf8b5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_tanhf8_core_avx2.S @@ -0,0 +1,844 @@ +/* Function tanhf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * NOTE: Since the hyperbolic tangent function is odd + * (tanh(x) = -tanh(-x)), below algorithm deals with the absolute + * value of the argument |x|: tanh(x) = sign(x) * tanh(|x|) + * + * We use a table lookup method to compute tanh(|x|). + * The basic idea is to split the input range into a number of subintervals + * and to approximate tanh(.) with a polynomial on each of them. + * + * IEEE SPECIAL CONDITIONS: + * x = [+,-]0, r = [+,-]0 + * x = +Inf, r = +1 + * x = -Inf, r = -1 + * x = QNaN, r = QNaN + * x = SNaN, r = QNaN + * + * + * ALGORITHM DETAILS + * We handle special values in a callout function, aside from main path + * computations. "Special" for this algorithm are: + * INF, NAN, |x| > HUGE_THRESHOLD + * + * + * Main path computations are organized as follows: + * Actually we split the interval [0, SATURATION_THRESHOLD) + * into a number of subintervals. On each subinterval we approximate tanh(.) + * with a minimax polynomial of pre-defined degree. Polynomial coefficients + * are computed beforehand and stored in table. We also use + * + * y := |x| + B, + * + * here B depends on subinterval and is used to make argument + * closer to zero. + * We also add large fake interval [SATURATION_THRESHOLD, HUGE_THRESHOLD], + * where 1.0 + 0.0*y + 0.0*y^2 ... coefficients are stored - just to + * preserve main path computation logic but return 1.0 for all arguments. + * + * Hence reconstruction looks as follows: + * we extract proper polynomial and range reduction coefficients + * (Pj and B), corresponding to subinterval, to which |x| belongs, + * and return + * + * r := sign(x) * (P0 + P1 * y + ... + Pn * y^n) + * + * NOTE: we use multiprecision technique to multiply and sum the first + * K terms of the polynomial. So Pj, j = 0..K are stored in + * table each as a pair of target precision numbers (Pj and PLj) to + * achieve wider than target precision. + * + * + */ + +/* Offsets for data table __svml_stanh_data_internal + */ +#define _dbP 0 +#define _sSignMask 4288 +#define _sAbsMask 4320 +#define _iExpMantMask 4352 +#define _iExpMask 4384 +#define _iMinIdxOfsMask 4416 +#define _iMaxIdxMask 4448 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_tanhf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + pushq %r12 + subq $120, %rsp + lea _dbP+16+__svml_stanh_data_internal(%rip), %r10 + vmovaps %ymm0, %ymm12 + +/* Here huge arguments, INF and NaNs are filtered out to callout. */ + vpand _iExpMantMask+__svml_stanh_data_internal(%rip), %ymm12, %ymm14 + +/* + * small table specific variables * + * Constant loading + */ + vmovups _iMaxIdxMask+__svml_stanh_data_internal(%rip), %ymm8 + vpsubd _iMinIdxOfsMask+__svml_stanh_data_internal(%rip), %ymm14, %ymm9 + +/* if VMIN, VMAX is defined for I type */ + vxorps %ymm15, %ymm15, %ymm15 + vpcmpgtd %ymm15, %ymm9, %ymm0 + vpand %ymm0, %ymm9, %ymm7 + vpcmpgtd %ymm8, %ymm9, %ymm6 + vblendvps %ymm6, %ymm8, %ymm7, %ymm3 + vpsrld $14, %ymm3, %ymm1 + vpcmpgtd _iExpMask+__svml_stanh_data_internal(%rip), %ymm14, %ymm13 + vmovmskps %ymm13, %r11d + vandps _sAbsMask+__svml_stanh_data_internal(%rip), %ymm12, %ymm10 + vandps _sSignMask+__svml_stanh_data_internal(%rip), %ymm12, %ymm11 + vextractf128 $1, %ymm1, %xmm2 + vmovd %xmm1, %r9d + vmovd %xmm2, %ecx + vpextrd $1, %xmm2, %edx + vpextrd $1, %xmm1, %r8d + movslq %r9d, %r9 + movslq %edx, %rdx + movslq %r8d, %r8 + vpextrd $2, %xmm1, %edi + movslq %ecx, %rcx + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -8; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xf8, 0xff, 0xff, 0xff, 0x22 + vpextrd $3, %xmm2, %r12d + vpextrd $3, %xmm1, %esi + vpextrd $2, %xmm2, %eax + movslq %edi, %rdi + movslq %r12d, %r12 + movslq %esi, %rsi + movslq %eax, %rax + vmovupd -16(%r9,%r10), %xmm5 + vmovupd -16(%rdx,%r10), %xmm14 + vmovupd -16(%rcx,%r10), %xmm13 + vmovupd (%r9,%r10), %xmm1 + vmovupd (%r8,%r10), %xmm2 + vmovupd -16(%r8,%r10), %xmm4 + vinsertf128 $1, -16(%rdi,%r10), %ymm5, %ymm15 + vinsertf128 $1, -16(%r12,%r10), %ymm14, %ymm3 + vinsertf128 $1, -16(%rax,%r10), %ymm13, %ymm6 + vinsertf128 $1, (%rdi,%r10), %ymm1, %ymm5 + vinsertf128 $1, (%rsi,%r10), %ymm2, %ymm14 + vunpcklpd %ymm3, %ymm6, %ymm8 + vunpckhpd %ymm3, %ymm6, %ymm6 + vunpcklpd %ymm14, %ymm5, %ymm3 + vunpckhpd %ymm14, %ymm5, %ymm2 + vmovupd (%rcx,%r10), %xmm13 + vcvtps2pd %xmm10, %ymm5 + vextractf128 $1, %ymm10, %xmm10 + vfmadd213pd %ymm3, %ymm5, %ymm2 + vinsertf128 $1, -16(%rsi,%r10), %ymm4, %ymm0 + vmovupd (%rdx,%r10), %xmm4 + vunpcklpd %ymm0, %ymm15, %ymm9 + vunpckhpd %ymm0, %ymm15, %ymm7 + vfmadd213pd %ymm7, %ymm5, %ymm2 + vfmadd213pd %ymm9, %ymm5, %ymm2 + vinsertf128 $1, (%r12,%r10), %ymm4, %ymm0 + vcvtps2pd %xmm10, %ymm4 + vinsertf128 $1, (%rax,%r10), %ymm13, %ymm15 + vunpcklpd %ymm0, %ymm15, %ymm1 + vunpckhpd %ymm0, %ymm15, %ymm0 + vfmadd213pd %ymm1, %ymm4, %ymm0 + vcvtpd2ps %ymm2, %xmm1 + vfmadd213pd %ymm6, %ymm4, %ymm0 + vfmadd213pd %ymm8, %ymm4, %ymm0 + vcvtpd2ps %ymm0, %xmm0 + vinsertf128 $1, %xmm0, %ymm1, %ymm2 + vorps %ymm11, %ymm2, %ymm0 + testl %r11d, %r11d + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r13 r14 r15 r11d ymm0 ymm12 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $120, %rsp + cfi_restore(12) + popq %r12 + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -8; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xf8, 0xff, 0xff, 0xff, 0x22 + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm12, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r13 r14 r15 r11d ymm0 + + xorl %r12d, %r12d + # LOE rbx r13 r14 r15 r11d r12d + + vzeroupper + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + movl %r11d, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -120; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x88, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -128; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call tanhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_tanhf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_stanh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct +{ + __declspec(align(32)) VUINT32 _dbP[(134*4)][2]; + __declspec(align(32)) VUINT32 _sSignMask[8][1]; + __declspec(align(32)) VUINT32 _sAbsMask[8][1]; + __declspec(align(32)) VUINT32 _iExpMantMask[8][1]; + __declspec(align(32)) VUINT32 _iExpMask[8][1]; + __declspec(align(32)) VUINT32 _iMinIdxOfsMask[8][1]; + __declspec(align(32)) VUINT32 _iMaxIdxMask[8][1]; +} __svml_stanh_data_internal; +#endif +__svml_stanh_data_internal: + /* Pol_000: err=7.93e-09, x in [0.0000000; 0.0312500]. */ + .quad 0x0000000000000000 /* A00 = +0.000000000000000000000e-01 */ + .quad 0x3FF00000022C70EB /* A01 = +1.000000008097283510367e+00 */ + .quad 0xBED00E878CFFA194 /* A02 = -3.828228912518614443549e-06 */ + .quad 0xBFD551766D0607A9 /* A03 = -3.330970825846813476723e-01 */ + .quad 0xBE53D60CE3E4C297 /* A00 = -1.847383956330407336230e-08 */ + .quad 0x3FF000024177CF5C /* A01 = +1.000002151235967140508e+00 */ + .quad 0xBF1758BC94A51A25 /* A02 = -8.906031613262943753568e-05 */ + .quad 0xBFD53EAE67E0D4F0 /* A03 = -3.319507612644221339337e-01 */ + .quad 0xBE5A9E47EF32D6FE /* A00 = -2.479020984039698285657e-08 */ + .quad 0x3FF00002DA983057 /* A01 = +1.000002721676556793895e+00 */ + .quad 0xBF1BD953509E94AA /* A02 = -1.062352277175377670507e-04 */ + .quad 0xBFD53BDB562EEDD5 /* A03 = -3.317783681520414806876e-01 */ + .quad 0xBE6191BBE496D294 /* A00 = -3.272532162914017685901e-08 */ + .quad 0x3FF0000390492017 /* A01 = +1.000003398528866105366e+00 */ + .quad 0xBF20727E814A57CE /* A02 = -1.254825043772153972919e-04 */ + .quad 0xBFD538DE060A6F22 /* A03 = -3.315959033004550748913e-01 */ + .quad 0xBE66DAFA2A893A25 /* A00 = -4.257146219278012568149e-08 */ + .quad 0x3FF0000465E08CD1 /* A01 = +1.000004194219219266770e+00 */ + .quad 0xBF2341C765EF91B6 /* A02 = -1.469188600530365522261e-04 */ + .quad 0xBFD535B6841FAF9E /* A03 = -3.314033785124993469751e-01 */ + .quad 0xBE6D5794E361E964 /* A00 = -5.465394929765249413434e-08 */ + .quad 0x3FF000055EE2A0CB /* A01 = +1.000005121846742950353e+00 */ + .quad 0xBF265E6C77E66C8B /* A02 = -1.706607253709506650304e-04 */ + .quad 0xBFD53264DDCCEDA6 /* A03 = -3.312008062382240103361e-01 */ + .quad 0xBE729C844D374A6E /* A00 = -6.933284462462096107184e-08 */ + .quad 0x3FF000067F019093 /* A01 = +1.000006195180536350264e+00 */ + .quad 0xBF29CC5348D6DCE5 /* A02 = -1.968242326435338705130e-04 */ + .quad 0xBFD52EE92121ED35 /* A03 = -3.309881995734998416658e-01 */ + .quad 0xBE775AEA17EAA872 /* A00 = -8.700465590574974405858e-08 */ + .quad 0x3FF00007CA1D66B8 /* A01 = +1.000007428656699559610e+00 */ + .quad 0xBF2D8F5EB98A2637 /* A02 = -2.255252009216044881395e-04 */ + .quad 0xBFD52B435CDF9128 /* A03 = -3.307655722585587376727e-01 */ + .quad 0xBE7D04DA28C343F0 /* A00 = -1.081040272327705484794e-07 */ + .quad 0x3FF000094443CCF5 /* A01 = +1.000008837375216730337e+00 */ + .quad 0xBF30D5B76C947AE5 /* A02 = -2.568791210978817814332e-04 */ + .quad 0xBFD52773A0776FAD /* A03 = -3.305329386764651045105e-01 */ + .quad 0xBE81DD77A12C51C7 /* A00 = -1.331054169875768625701e-07 */ + .quad 0x3FF0000AF1AFD2DA /* A01 = +1.000010437096696680470e+00 */ + .quad 0xBF331230624C1680 /* A02 = -2.910011410651516805537e-04 */ + .quad 0xBFD52379FC0B61DF /* A03 = -3.302903138515186909352e-01 */ + .quad 0xBE85D04EEEB3C435 /* A00 = -1.625247628488202841012e-07 */ + .quad 0x3FF0000CD6C9B1F2 /* A01 = +1.000012244238970726684e+00 */ + .quad 0xBF357F0742FADDD4 /* A02 = -3.280060509313874068243e-04 */ + .quad 0xBFD51F56806D0E81 /* A03 = -3.300377134475880880338e-01 */ + .quad 0xBE8A6E289B59681B /* A00 = -1.969211333326924655065e-07 */ + .quad 0x3FF0000EF8268F72 /* A01 = +1.000014275873550406715e+00 */ + .quad 0xBF381E277A1B747A /* A02 = -3.680082682942575423093e-04 */ + .quad 0xBFD51B093F1D6FD4 /* A03 = -3.297751537663746734808e-01 */ + .quad 0xBE8FCBC40EE9ABD5 /* A00 = -2.368983653301529373887e-07 */ + .quad 0x3FF000115A883B6C /* A01 = +1.000016549721943981410e+00 */ + .quad 0xBF3AF17AC974B3D9 /* A02 = -4.111218235774406434303e-04 */ + .quad 0xBFD516924A4C549C /* A03 = -3.295026517456081105450e-01 */ + .quad 0xBE92FFBC60A3F956 /* A00 = -2.831066871072026054144e-07 */ + .quad 0x3FF0001402DCED8A /* A01 = +1.000019084151832604590e+00 */ + .quad 0xBF3DFAE9390C4801 /* A02 = -4.574603454311488280083e-04 */ + .quad 0xBFD511F1B4D7DC3A /* A03 = -3.292202249571719585575e-01 */ + .quad 0xBE9690A22F96D5AD /* A00 = -3.362443262393081632612e-07 */ + .quad 0x3FF00016F63EFF5D /* A01 = +1.000021898173108825247e+00 */ + .quad 0xBF409E2C839605BB /* A02 = -5.071370461992499986334e-04 */ + .quad 0xBFD50D27924BEE00 /* A03 = -3.289278916051614487515e-01 */ + .quad 0xBE9AA56C65E72A73 /* A00 = -3.970591019557469835586e-07 */ + .quad 0x3FF0001A39F4A43E /* A01 = +1.000025011433776978009e+00 */ + .quad 0xBF425BD74C3D6667 /* A02 = -5.602647074553602319844e-04 */ + .quad 0xBFD50833F6E1ABA2 /* A03 = -3.286256705238718156536e-01 */ + .quad 0xBE9F4BD4FF1A83B0 /* A00 = -4.663500013744687071912e-07 */ + .quad 0x3FF0001DD36F9EC2 /* A01 = +1.000028444215715683896e+00 */ + .quad 0xBF44376634149405 /* A02 = -6.169556656102642569831e-04 */ + .quad 0xBFD50316F77EDEE5 /* A03 = -3.283135811757190158922e-01 */ + .quad 0xBEA3B625387BB079 /* A00 = -5.874486399249461304297e-07 */ + .quad 0x3FF00023E14CFBA9 /* A01 = +1.000034217911642153709e+00 */ + .quad 0xBF47392F923218D2 /* A02 = -7.087213783883111826306e-04 */ + .quad 0xBFD4FB1FACDEB938 /* A03 = -3.278273761924483942209e-01 */ + .quad 0xBEAA6E24F543500A /* A00 = -7.876828740601738750574e-07 */ + .quad 0x3FF0002D5C6E8412 /* A01 = +1.000043259679163742959e+00 */ + .quad 0xBF4BAF02BD7FDD70 /* A02 = -8.448375110664940040861e-04 */ + .quad 0xBFD4EFEE6527A7DE /* A03 = -3.271442401734229177279e-01 */ + .quad 0xBEB16E3EBE2157D0 /* A00 = -1.038947396133402500647e-06 */ + .quad 0x3FF00038990FEE2F /* A01 = +1.000053975962952312884e+00 */ + .quad 0xBF50569481C574CB /* A02 = -9.972048056490652716971e-04 */ + .quad 0xBFD4E419278DA2B4 /* A03 = -3.264220129263251113372e-01 */ + .quad 0xBEB6A7B6723165D4 /* A00 = -1.350350836279403750524e-06 */ + .quad 0x3FF00045CAB4158E /* A01 = +1.000066558657042303793e+00 */ + .quad 0xBF531D7C9C849108 /* A02 = -1.166698160951775212202e-03 */ + .quad 0xBFD4D7A0BB33B152 /* A03 = -3.256608799117844954552e-01 */ + .quad 0xBEBD0EE2A8654AFD /* A00 = -1.732000471561702711532e-06 */ + .quad 0x3FF00055276F18D6 /* A01 = +1.000081209219890521211e+00 */ + .quad 0xBF562FDBA3FB6C6C /* A02 = -1.354183666925102939860e-03 */ + .quad 0xBFD4CA85F1B93DB2 /* A03 = -3.248610363561638125773e-01 */ + .quad 0xBEC269D4036A207E /* A00 = -2.195047297096822741730e-06 */ + .quad 0x3FF00066E7DA6E4E /* A01 = +1.000098138500919997540e+00 */ + .quad 0xBF5991499FC36B3A /* A02 = -1.560518167983372759405e-03 */ + .quad 0xBFD4BCC9A72283D6 /* A03 = -3.240226871658341556426e-01 */ + .quad 0xBEC7154B6C09CFE1 /* A00 = -2.751729738565190291276e-06 */ + .quad 0x3FF0007B47086B80 /* A01 = +1.000117566559055148900e+00 */ + .quad 0xBF5D455433B4F8F4 /* A02 = -1.786548832412968197680e-03 */ + .quad 0xBFD4AE6CC1BFE145 /* A03 = -3.231460468373550942722e-01 */ + .quad 0xBECCA68CC64A0F8A /* A00 = -3.415415948561670285790e-06 */ + .quad 0x3FF00092827742F7 /* A01 = +1.000139722473418535387e+00 */ + .quad 0xBF60A7BF15A527AF /* A02 = -2.033112728132522705610e-03 */ + .quad 0xBFD49F703214084C /* A03 = -3.222313393636155876010e-01 */ + .quad 0xBED19E68676B241B /* A00 = -4.200644630977303616698e-06 */ + .quad 0x3FF000ACDA037B26 /* A01 = +1.000164844146362863597e+00 */ + .quad 0xBF62D99F836A02F8 /* A02 = -2.301036405072284102280e-03 */ + .quad 0xBFD48FD4F2B91B28 /* A03 = -3.212787981359945810311e-01 */ + .quad 0xBED57CF4B0C7AA54 /* A00 = -5.123164339408145209103e-06 */ + .quad 0x3FF000CA8FD9E1A1 /* A01 = +1.000193178099017865534e+00 */ + .quad 0xBF653A014548E686 /* A02 = -2.591135484433962181405e-03 */ + .quad 0xBFD47F9C0844B38F /* A03 = -3.202886658426046806447e-01 */ + .quad 0xBEDA012B1B1A41E2 /* A00 = -6.199971197454598722328e-06 */ + .quad 0x3FF000EBE868FDF4 /* A01 = +1.000224979259539459520e+00 */ + .quad 0xBF67CA9427E0A544 /* A02 = -2.904214255086275467410e-03 */ + .quad 0xBFD46EC6812ADB37 /* A03 = -3.192611943626845749655e-01 */ + .quad 0xBEDF3EAC5BF12194 /* A00 = -7.449344990702664567927e-06 */ + .quad 0x3FF001112A520784 /* A01 = +1.000260510744255704196e+00 */ + .quad 0xBF6A8D01ABDA4DC4 /* A02 = -3.241065277345108255891e-03 */ + .quad 0xBFD45D55759FFA4A /* A03 = -3.181966446572103146551e-01 */ + .quad 0xBEE2A541BC274267 /* A00 = -8.890883582164319970972e-06 */ + .quad 0x3FF0013A9E5961F2 /* A01 = +1.000300043631906721231e+00 */ + .quad 0xBF6D82ECD080C540 /* A02 = -3.602468994380686462264e-03 */ + .quad 0xBFD44B4A0779C0AD /* A03 = -3.170952866557950611259e-01 */ + .quad 0xBEE61D97609A27F4 /* A00 = -1.054553560499505625520e-05 */ + .quad 0x3FF001688F56A3AF /* A01 = +1.000343856731187974773e+00 */ + .quad 0xBF7056F8EFB683EC /* A02 = -3.989193351487490407647e-03 */ + .quad 0xBFD438A5620F0F74 /* A03 = -3.159573991399533543500e-01 */ + .quad 0xBEEA145429EDD370 /* A00 = -1.243563138839952927732e-05 */ + .quad 0x3FF0019B4A242A67 /* A01 = +1.000392236341804297339e+00 */ + .quad 0xBF7207D31CA78D9B /* A02 = -4.401993423445739288258e-03 */ + .quad 0xBFD42568BA16E7CD /* A03 = -3.147832696228050619602e-01 */ + .quad 0xBEEE96370D52680F /* A00 = -1.458491207477835326165e-05 */ + .quad 0x3FF001D31D8E4115 /* A01 = +1.000445476009251821736e+00 */ + .quad 0xBF73D4CC11EDC094 /* A02 = -4.841611050196221316400e-03 */ + .quad 0xBFD411954D8664E7 /* A03 = -3.135731942252974469021e-01 */ + .quad 0xBEF338C046215EF8 /* A00 = -1.833122622260562810219e-05 */ + .quad 0x3FF00230C32C2EC1 /* A01 = +1.000534784691737621998e+00 */ + .quad 0xBF76BD019BCC5DAF /* A02 = -5.551344188254799492943e-03 */ + .quad 0xBFD3F2C7156DC21E /* A03 = -3.116929730668135389848e-01 */ + .quad 0xBEF9B15EAE411EAE /* A00 = -2.450261207822986676092e-05 */ + .quad 0x3FF002C2DF057A4D /* A01 = +1.000674124886830940184e+00 */ + .quad 0xBF7B08CCD9AC1E30 /* A02 = -6.600189396301511801646e-03 */ + .quad 0xBFD3C7A7A114FED8 /* A03 = -3.090609620157755976777e-01 */ + .quad 0xBF00E36483C373B3 /* A00 = -3.221178528332122595812e-05 */ + .quad 0x3FF0036F419480D7 /* A01 = +1.000838524028997644777e+00 */ + .quad 0xBF7FD255D1777007 /* A02 = -7.768950679260206403087e-03 */ + .quad 0xBFD39A453911D6CE /* A03 = -3.062909180947429588215e-01 */ + .quad 0xBF05DFA04DD12059 /* A00 = -4.172046622180685472624e-05 */ + .quad 0x3FF00438B2A03D8D /* A01 = +1.001030633695197069599e+00 */ + .quad 0xBF828F8DBB4A9D10 /* A02 = -9.062869337255224921890e-03 */ + .quad 0xBFD36AAB704697D9 /* A03 = -3.033856007044711255993e-01 */ + .quad 0xBF0BF3E0C647DEFB /* A00 = -5.331544597092331081714e-05 */ + .quad 0x3FF005221063D36D /* A01 = +1.001253189109060359741e+00 */ + .quad 0xBF857A2CB3C96102 /* A02 = -1.048693584122917590862e-02 */ + .quad 0xBFD338E65BBB4FEC /* A03 = -3.003478904549854444639e-01 */ + .quad 0xBF11A506ED7C9D31 /* A00 = -6.730894835681591541979e-05 */ + .quad 0x3FF0062E4D0EA92A /* A01 = +1.001508999829250345925e+00 */ + .quad 0xBF88AB82C2761AF3 /* A02 = -1.204588085125866091241e-02 */ + .quad 0xBFD305028D6BD206 /* A03 = -2.971807843271395688234e-01 */ + .quad 0xBF1607C0922D9BF1 /* A00 = -8.403885708006799337092e-05 */ + .quad 0x3FF007606C341961 /* A01 = +1.001800940198869449560e+00 */ + .quad 0xBF8C25E6DA487BCF /* A02 = -1.374416688582682892494e-02 */ + .quad 0xBFD2CF0D0EE8F7B5 /* A03 = -2.938873906713255768075e-01 */ + .quad 0xBF1B3A8480A0A16D /* A00 = -1.038688061788578038307e-04 */ + .quad 0x3FF008BB802D02D6 /* A01 = +1.002131939589323561535e+00 */ + .quad 0xBF8FEB8AE99FD100 /* A02 = -1.558598065819483124983e-02 */ + .quad 0xBFD297135BD0911B /* A03 = -2.904709240558688843059e-01 */ + .quad 0xBF20ABB9BDB75C65 /* A00 = -1.271881327357976163798e-04 */ + .quad 0x3FF00A42A76D8CD1 /* A01 = +1.002504972472525901495e+00 */ + .quad 0xBF91FF3D752BB9E6 /* A02 = -1.757522609380570560722e-02 */ + .quad 0xBFD25D235C1F88B4 /* A03 = -2.869346999779154305799e-01 */ + .quad 0xBF243D3254425461 /* A00 = -1.544116913733432829448e-04 */ + .quad 0x3FF00BF909D1795E /* A01 = +1.002923048355647051011e+00 */ + .quad 0xBF94304E04D44942 /* A02 = -1.971551804042204897316e-02 */ + .quad 0xBFD2214B5E61CFA6 /* A03 = -2.832821294498394371075e-01 */ + .quad 0xBF286070011B61CE /* A00 = -1.859795307186510085994e-04 */ + .quad 0x3FF00DE1D5E1627E /* A01 = +1.003389201612804537689e+00 */ + .quad 0xBF9689D5F4163F59 /* A02 = -2.201017668045266231780e-02 */ + .quad 0xBFD1E39A11C3B42C /* A03 = -2.795167134743816728104e-01 */ + .quad 0xBF2D250B366A79E8 /* A00 = -2.223564326486314902259e-04 */ + .quad 0x3FF010003E134001 /* A01 = +1.003906481248123094829e+00 */ + .quad 0xBF990C9FF91F6F81 /* A02 = -2.446222265267250853271e-02 */ + .quad 0xBFD1A41E80084CDC /* A03 = -2.756420374218586655246e-01 */ + .quad 0xBF314DB5DDC2A30E /* A00 = -2.640313157465248123865e-04 */ + .quad 0x3FF012577608921B /* A01 = +1.004477940624503018441e+00 */ + .quad 0xBF9BB9626875B0C9 /* A02 = -2.707437288829409385849e-02 */ + .quad 0xBFD162E80768A9D0 /* A03 = -2.716617653228725615122e-01 */ + .quad 0xBF346A6133808864 /* A00 = -3.115165050094957730625e-04 */ + .quad 0x3FF014EAAFCC88A3 /* A01 = +1.005106627192198898157e+00 */ + .quad 0xBF9E90BEF9BF7419 /* A02 = -2.984903716411588595059e-02 */ + .quad 0xBFD12006545F7FAD /* A03 = -2.675796340899932457269e-01 */ + .quad 0xBF37F180DC3848EA /* A00 = -3.653468704395550778821e-04 */ + .quad 0x3FF017BD19147861 /* A01 = +1.005795572250939295955e+00 */ + .quad 0xBFA0C9A14C702E07 /* A02 = -3.278831537326359207851e-02 */ + .quad 0xBFD0DB895B650092 /* A03 = -2.633994476818851682154e-01 */ + .quad 0xBF3BEC6AAC6D7635 /* A00 = -4.260788377246944457107e-04 */ + .quad 0x3FF01AD1D884E719 /* A01 = +1.006547780778822565040e+00 */ + .quad 0xBFA260B2A1B1434A /* A02 = -3.589399551186163439542e-02 */ + .quad 0xBFD09581529E93D6 /* A03 = -2.591250712233067465817e-01 */ + .quad 0xBF4164E26167882B /* A00 = -5.308251737086202562063e-04 */ + .quad 0x3FF01FEF14B62B81 /* A01 = +1.007796364693348545316e+00 */ + .quad 0xBFA4EB014538AA42 /* A02 = -4.085544557559163403315e-02 */ + .quad 0xBFD029D36FEAF41F /* A03 = -2.525528519580024222613e-01 */ + .quad 0xBF46F6FFF4E53DC8 /* A00 = -7.008313930700277652464e-04 */ + .quad 0x3FF027CBB51CBBA0 /* A01 = +1.009715754956893363214e+00 */ + .quad 0xBFA89DEC9FEC112E /* A02 = -4.807986690687680864098e-02 */ + .quad 0xBFCF2A99464D0DB4 /* A03 = -2.434875100390009317053e-01 */ + .quad 0xBF4DCC9C4F66A4D9 /* A00 = -9.094012482836712945103e-04 */ + .quad 0x3FF030E7CFCCD583 /* A01 = +1.011939822882909068014e+00 */ + .quad 0xBFACAA3B95814081 /* A02 = -5.598627281199331645611e-02 */ + .quad 0xBFCDF78F156BE7CF /* A03 = -2.341173987004467604844e-01 */ + .quad 0xBF5308ED74E5C7A6 /* A00 = -1.161796466103906435435e-03 */ + .quad 0x3FF03B5986412ECB /* A01 = +1.014489674026594512313e+00 */ + .quad 0xBFB087EBA88DCC3F /* A02 = -6.457398285947223148806e-02 */ + .quad 0xBFCCBB9BD134862F /* A03 = -2.244753619680052991736e-01 */ + .quad 0xBF57FA23C00DF4B5 /* A00 = -1.463446533505758208674e-03 */ + .quad 0x3FF0473558A1BCC0 /* A01 = +1.017384859292903342975e+00 */ + .quad 0xBFB2E702BC6360EF /* A02 = -7.383744334527241048871e-02 */ + .quad 0xBFCB77D546379288 /* A03 = -2.145945160729250122955e-01 */ + .quad 0xBF5DD12971557F71 /* A00 = -1.819887610814388068450e-03 */ + .quad 0x3FF0548DDF5000A8 /* A01 = +1.020643112482540360020e+00 */ + .quad 0xBFB571B63DA186E1 /* A02 = -8.376635555898871710045e-02 */ + .quad 0xBFCA2D5202605148 /* A03 = -2.045080672838912594358e-01 */ + .quad 0xBF6252B1AD5D4F17 /* A00 = -2.236697221556737096709e-03 */ + .quad 0x3FF063738A910BF7 /* A01 = +1.024280110622155737232e+00 */ + .quad 0xBFB8270C8E6B601B /* A02 = -9.434584118878357184013e-02 */ + .quad 0xBFC8DD27D950A07E /* A03 = -1.942491351230763441116e-01 */ + .quad 0xBF66470C91730CFC /* A00 = -2.719425723258004842786e-03 */ + .quad 0x3FF073F468FCF331 /* A01 = +1.028309259519300633556e+00 */ + .quad 0xBFBB05C2952191E4 /* A02 = -1.055566419686964629854e-01 */ + .quad 0xBFC7886A770DE2BD /* A03 = -1.838505822486435070662e-01 */ + .quad 0xBF6AD114AC8E98EC /* A00 = -3.273525599485007861467e-03 */ + .quad 0x3FF0861BF53E5226 /* A01 = +1.032741506559554434119e+00 */ + .quad 0xBFBE0C4F9B461507 /* A02 = -1.173753503881763554650e-01 */ + .quad 0xBFC6302A037CDE3A /* A03 = -1.733448521642786954722e-01 */ + .quad 0xBF6FFBDE2A6C2AF8 /* A00 = -3.904279630096648551207e-03 */ + .quad 0x3FF099F2EB8E7DA3 /* A01 = +1.037585182326304034106e+00 */ + .quad 0xBFC09C74D192DDF0 /* A02 = -1.297746680554463516444e-01 */ + .quad 0xBFC4D571D8E3079F /* A03 = -1.627638157861470424859e-01 */ + .quad 0xBF72E8FDC0B952AA /* A00 = -4.616728994353872309042e-03 */ + .quad 0x3FF0AF7F273C9533 /* A01 = +1.042845872181101141152e+00 */ + .quad 0xBFC244C512736F10 /* A02 = -1.427236881344176033792e-01 */ + .quad 0xBFC379474F58B902 /* A03 = -1.521386277613104298645e-01 */ + .quad 0xBF762EABAF17395B /* A00 = -5.415602341101023557701e-03 */ + .quad 0x3FF0C6C3886F63FB /* A01 = +1.048526318502125631582e+00 */ + .quad 0xBFC3FDF9918EA12A /* A02 = -1.561881981590514389957e-01 */ + .quad 0xBFC21CA89ECAB895 /* A03 = -1.414995932913753196036e-01 */ + .quad 0xBF79D387CE5B2BAE /* A00 = -6.305246822828998107258e-03 */ + .quad 0x3FF0DFBFE2346376 /* A01 = +1.054626353847394337748e+00 */ + .quad 0xBFC5C6DA43602620 /* A02 = -1.701309994680721970894e-01 */ + .quad 0xBFC0C08BD8DB6631 /* A03 = -1.308760460731704100557e-01 */ + .quad 0xBF7DDBA8E8DA9060 /* A00 = -7.289562037531366334164e-03 */ + .quad 0x3FF0FA70F0D1B464 /* A01 = +1.061142864894713433443e+00 */ + .quad 0xBFC79E18D92BAA7C /* A02 = -1.845122394946264732241e-01 */ + .quad 0xBFBECBBBF74C2669 /* A03 = -1.202962378266875381749e-01 */ + .quad 0xBF81254E76EA25DA /* A00 = -8.371937755572145950511e-03 */ + .quad 0x3FF116D05835EBD0 /* A01 = +1.068069786618014660462e+00 */ + .quad 0xBFC982539E2ED224 /* A02 = -1.992897531869327609755e-01 */ + .quad 0xBFBC1B043C350159 /* A03 = -1.097872397413132278254e-01 */ + .quad 0xBF8391ACBA863403 /* A00 = -9.555196230190082448686e-03 */ + .quad 0x3FF134D4AA477FE2 /* A01 = +1.075398125794884141015e+00 */ + .quad 0xBFCB7218609FEAFB /* A02 = -2.144194099235717521079e-01 */ + .quad 0xBFB970A16CB88329 /* A03 = -9.937485603633135211599e-02 */ + .quad 0xBF87935088E48E8B /* A00 = -1.151144902957603431692e-02 */ + .quad 0x3FF1649892AD7DD3 /* A01 = +1.087059567413110938716e+00 */ + .quad 0xBFCE6971DDE75409 /* A02 = -2.375929196847723912089e-01 */ + .quad 0xBFB58291E88CB251 /* A03 = -8.402358939628952472223e-02 */ + .quad 0xBF8DB3A62C325325 /* A00 = -1.450280973794233242702e-02 */ + .quad 0x3FF1A9C900C6DEEA /* A01 = +1.103951457056548068891e+00 */ + .quad 0xBFD13DBC65B0E08E /* A02 = -2.693930619311765140012e-01 */ + .quad 0xBFB06696F62696D1 /* A03 = -6.406539449252625362252e-02 */ + .quad 0xBF92583699F2E27A /* A00 = -1.791463198307716858659e-02 */ + .quad 0x3FF1F451B85AA9F0 /* A01 = +1.122148246892376022288e+00 */ + .quad 0xBFD34FD5F8288180 /* A02 = -3.017477916164565954205e-01 */ + .quad 0xBFA6FB692825B683 /* A03 = -4.488686194495718900788e-02 */ + .quad 0xBF9641C26E673D6F /* A00 = -2.173522757385398448959e-02 */ + .quad 0x3FF24364DA5E2B07 /* A01 = +1.141453602790251542487e+00 */ + .quad 0xBFD564A5A5EF5890 /* A02 = -3.342680092295120530821e-01 */ + .quad 0xBF9B43712011A982 /* A03 = -2.662445791467283467968e-02 */ + .quad 0xBF9A901038EC2F39 /* A00 = -2.594018313816024226548e-02 */ + .quad 0x3FF2961356DFFEBA /* A01 = +1.161639537196534011088e+00 */ + .quad 0xBFD775EBB17198C7 /* A02 = -3.665723069046972759644e-01 */ + .quad 0xBF833B1A926CD462 /* A03 = -9.390075295963199591975e-03 */ + .quad 0xBF9F396A6A461B91 /* A00 = -3.049246095317987084727e-02 */ + .quad 0x3FF2EB53BAEF534B /* A01 = +1.182452898229899629357e+00 */ + .quad 0xBFD97DABF8AD8BBD /* A02 = -3.982953957076310058660e-01 */ + .quad 0x3F7B8F6A3E0F8837 /* A03 = +6.728568086119371925713e-03 */ + .quad 0xBFA21878590F8BAA /* A00 = -3.534294211546946951064e-02 */ + .quad 0x3FF34209790236E1 /* A01 = +1.203622315111197105253e+00 */ + .quad 0xBFDB764C0E71BECB /* A02 = -4.290952817018306997277e-01 */ + .quad 0x3F962FE0C03F84C0 /* A03 = +2.166701482190513949888e-02 */ + .quad 0xBFA4B36B9AD27ECC /* A00 = -4.043136849327097492868e-02 */ + .quad 0x3FF3990C5B12FC16 /* A01 = +1.224865298994477935679e+00 */ + .quad 0xBFDD5AABB0D01390 /* A02 = -4.586590983092770912322e-01 */ + .quad 0x3FA21DAF5CA162DB /* A03 = +3.538272863142363083844e-02 */ + .quad 0xBFA7645E4D7BF28B /* A00 = -4.568762489177399105378e-02 */ + .quad 0x3FF3EF2FD51C0D9F /* A01 = +1.245895225962932562069e+00 */ + .quad 0xBFDF26377E1B686E /* A02 = -4.867075664057044503963e-01 */ + .quad 0x3FA8803E756EE812 /* A03 = +4.785342391501513914509e-02 */ + .quad 0xBFAA210925C64413 /* A00 = -5.103329263796054643398e-02 */ + .quad 0x3FF44349F897D8E7 /* A01 = +1.266427966181760345066e+00 */ + .quad 0xBFE06A7B02C6D8E2 /* A02 = -5.129981092675530707226e-01 */ + .quad 0x3FAE3F194734F5D0 /* A03 = +5.907515520309980505687e-02 */ + .quad 0xBFACDE48F8A19BBB /* A00 = -5.638340029764018351832e-02 */ + .quad 0x3FF49439D5466582 /* A01 = +1.286187966447272845727e+00 */ + .quad 0xBFE131C7C1063DDC /* A02 = -5.373266954429101183166e-01 */ + .quad 0x3FB1ADEEC36AD805 /* A03 = +6.906025191241844940482e-02 */ + .quad 0xBFAF905D8F585680 /* A00 = -6.164829611604449866036e-02 */ + .quad 0x3FF4E0ED1FD27F99 /* A01 = +1.304913639360142818546e+00 */ + .quad 0xBFE1E7A859DC1D3D /* A02 = -5.595285182070380836095e-01 */ + .quad 0x3FB3ED018E4642A1 /* A03 = +7.783517573831001679086e-02 */ + .quad 0xBFB11595104160BA /* A00 = -6.673556944713512906198e-02 */ + .quad 0x3FF528650340490B /* A01 = +1.322361958217302513319e+00 */ + .quad 0xBFE28B14B40BC974 /* A02 = -5.794776455425521000109e-01 */ + .quad 0x3FB5DF49F5BAF6D7 /* A03 = +8.543836831355676453281e-02 */ + .quad 0xBFB2513A97344BA4 /* A00 = -7.155195418844911836587e-02 */ + .quad 0x3FF569BA0DB5EE14 /* A01 = +1.338312200124055273420e+00 */ + .quad 0xBFE31B53A8B67B20 /* A02 = -5.970857901737396389308e-01 */ + .quad 0x3FB787F297BB0544 /* A03 = +9.191814617499455275507e-02 */ + .quad 0xBFB37512E848FAFA /* A00 = -7.600515528700305112331e-02 */ + .quad 0x3FF5A41F33B403C8 /* A01 = +1.352568819013173495591e+00 */ + .quad 0xBFE397F6EA9A58A5 /* A02 = -6.123003561103997904880e-01 */ + .quad 0x3FB8EAA9FF25CA06 /* A03 = +9.733068923177520814782e-02 */ + .quad 0xBFB47B3E603AFC5D /* A00 = -8.000554894805263217439e-02 */ + .quad 0x3FF5D6E3EDE40487 /* A01 = +1.364963464031718975988e+00 */ + .quad 0xBFE400D5BCA6D631 /* A02 = -6.251019177058819709103e-01 */ + .quad 0x3FBA0B830ED567FE /* A03 = +1.017381583418739132707e-01 */ + .quad 0xBFB5BBFE8AC90496 /* A00 = -8.489981544791400103200e-02 */ + .quad 0x3FF612BA70107E95 /* A01 = +1.379572332145390989311e+00 */ + .quad 0xBFE477EAF1FA7693 /* A02 = -6.396383978023599814478e-01 */ + .quad 0x3FBB4784B7C08A95 /* A03 = +1.065600346196709652391e-01 */ + .quad 0xBFB6D5D940743939 /* A00 = -8.920057128509463473254e-02 */ + .quad 0x3FF644A8748F70CE /* A01 = +1.391762214006166953340e+00 */ + .quad 0xBFE4D646AB07EA37 /* A02 = -6.511567440459832267763e-01 */ + .quad 0x3FBC354F4E1D5292 /* A03 = +1.101884427747086558913e-01 */ + .quad 0xBFB7223D19E4F3D1 /* A00 = -9.036619074045339206069e-02 */ + .quad 0x3FF6518FEB42B7FA /* A01 = +1.394912642466350494175e+00 */ + .quad 0xBFE4ED86CB87498C /* A02 = -6.539949393430091184598e-01 */ + .quad 0x3FBC6D29F28CCA9B /* A03 = +1.110407082713131127205e-01 */ + .quad 0xBFB6878652FF6312 /* A00 = -8.800544287022329936754e-02 */ + .quad 0x3FF63948C302D040 /* A01 = +1.388985406648330922508e+00 */ + .quad 0xBFE4C4E2E7904E17 /* A02 = -6.490339777687407218920e-01 */ + .quad 0x3FBC127356CA1ABE /* A03 = +1.096565329445224612481e-01 */ + .quad 0xBFB4F5D18B0C91D6 /* A00 = -8.187589306596207427980e-02 */ + .quad 0x3FF5FD27EB7DD0B8 /* A01 = +1.374305648697413673176e+00 */ + .quad 0xBFE464E01A2B2FC6 /* A02 = -6.373138915164353601739e-01 */ + .quad 0x3FBB460547674A30 /* A03 = +1.065371798825160976065e-01 */ + .quad 0xBFB26642FA16A685 /* A00 = -7.187288861919156890412e-02 */ + .quad 0x3FF59F9BEDE1C95A /* A01 = +1.351467065073470141812e+00 */ + .quad 0xBFE3D67920C8FBEA /* A02 = -6.199308052381387046381e-01 */ + .quad 0x3FBA24F6A8D3CBC1 /* A03 = +1.021265184570401413078e-01 */ + .quad 0xBFADB5294794F097 /* A00 = -5.802277563859197656582e-02 */ + .quad 0x3FF523EA7B9CF453 /* A01 = +1.321268542159732772845e+00 */ + .quad 0xBFE322A8B55E35DB /* A02 = -5.979808370918208160205e-01 */ + .quad 0x3FB8C8673B1B3E37 /* A03 = +9.680791085269722928697e-02 */ + .quad 0xBFA4B7D661965C6A /* A00 = -4.046506825687219699450e-02 */ + .quad 0x3FF48DE3E2CE3122 /* A01 = +1.284641157110919085227e+00 */ + .quad 0xBFE251FED1A7F445 /* A02 = -5.725092024655472622285e-01 */ + .quad 0x3FB745699FCABDB9 /* A03 = +9.090290213747821701507e-02 */ + .quad 0xBF93E60456E4EE1D /* A00 = -1.943213253365004902773e-02 */ + .quad 0x3FF3E1A14E628A59 /* A01 = +1.242585474196536532432e+00 */ + .quad 0xBFE16C5AB660E876 /* A02 = -5.444768488007543094653e-01 */ + .quad 0x3FB5AD33AA8C188F /* A03 = +8.467410005332197397987e-02 */ + .quad 0x3F738C17C47C7961 /* A00 = +4.772274820224659853951e-03 */ + .quad 0x3FF3234DDE3BD146 /* A01 = +1.196119182682268355933e+00 */ + .quad 0xBFE078C0D77A9D3B /* A02 = -5.147403915952176722826e-01 */ + .quad 0x3FB40D74B3E276B8 /* A03 = +7.833032027925923568290e-02 */ + .quad 0x3FA0474BECC689C7 /* A00 = +3.179394975019849550746e-02 */ + .quad 0x3FF256FB4FA7D18A /* A01 = +1.146235762743432307076e+00 */ + .quad 0xBFDEFA8E3FB285E2 /* A02 = -4.840427038235174395098e-01 */ + .quad 0x3FB270C007493D59 /* A03 = +7.203293016322244446403e-02 */ + .quad 0x3FAF5BD51E479BDC /* A00 = +6.124750132203590768931e-02 */ + .quad 0x3FF18081D0B53BC5 /* A01 = +1.093873801484492647162e+00 */ + .quad 0xBFDCFE2439BD0C03 /* A02 = -4.530115665294831006626e-01 */ + .quad 0x3FB0DEFE5A45AFDD /* A03 = +6.590261176978580437424e-02 */ + .quad 0x3FB7BD5D2806EA26 /* A00 = +9.273321368429118805032e-02 */ + .quad 0x3FF0A369E35B4440 /* A01 = +1.039895904647224256223e+00 */ + .quad 0xBFDB04BC5C9951E7 /* A02 = -4.221640495573226181669e-01 */ + .quad 0x3FAEBBBAA9D6DEEF /* A03 = +6.002600978120919278380e-02 */ + .quad 0x3FC01BE411098DBC /* A00 = +1.258511622610124502941e-01 */ + .quad 0x3FEF85BDABC031C1 /* A01 = +9.850757936961188621083e-01 */ + .quad 0xBFD91521375097C2 /* A02 = -3.919146576102968682065e-01 */ + .quad 0x3FABE26F0086D982 /* A03 = +5.446192628317005068883e-02 */ + .quad 0x3FC481D7FF5776B9 /* A00 = +1.602125164781023347604e-01 */ + .quad 0x3FEDC3506C1E7218 /* A01 = +9.300920592973538347792e-01 */ + .quad 0xBFD7349A88DA7D4F /* A02 = -3.625856720409119104964e-01 */ + .quad 0x3FA936E2DFF8E2AE /* A03 = +4.924687370334389358018e-02 */ + .quad 0x3FC90471F96FA27A /* A00 = +1.954481571149420671141e-01 */ + .quad 0x3FEC0451601987A2 /* A01 = +8.755270840595026360376e-01 */ + .quad 0xBFD5671CD4B898DC /* A02 = -3.344184949259110251063e-01 */ + .quad 0x3FA6BB9594603B67 /* A03 = +4.439990459660841243261e-02 */ + .quad 0x3FCFD8ADB9ED944C /* A00 = +2.488000066615846384011e-01 */ + .quad 0x3FE978C073F6809A /* A01 = +7.959902062321078108909e-01 */ + .quad 0xBFD2DF7E00BCD5A9 /* A02 = -2.948908812716931060471e-01 */ + .quad 0x3FA3614033D490B2 /* A03 = +3.785133965200894456959e-02 */ + .quad 0x3FD4846A12AFE5A0 /* A00 = +3.205819303981005674586e-01 */ + .quad 0x3FE63A1147D40472 /* A01 = +6.945883181471244061100e-01 */ + .quad 0xBFCFA2268AD34450 /* A02 = -2.471359422548027318101e-01 */ + .quad 0x3F9F150201D9FFE0 /* A03 = +3.035357605267552383310e-02 */ + .quad 0x3FD9018641F82BEB /* A00 = +3.907180446846598154131e-01 */ + .quad 0x3FE33B7C220FFBDC /* A01 = +6.010113396913498995389e-01 */ + .quad 0xBFCA4E4187E29C86 /* A02 = -2.055131829740483584423e-01 */ + .quad 0x3F98C30CED19F8F4 /* A03 = +2.418155858185229434287e-02 */ + .quad 0x3FDD4B8255BEB078 /* A00 = +4.577337109901757905561e-01 */ + .quad 0x3FE0858B19D3A49B /* A01 = +5.163016800335243905451e-01 */ + .quad 0xBFC5BC929EACE564 /* A02 = -1.698172831327539045176e-01 */ + .quad 0x3F93A083CE57DE2B /* A03 = +1.916700312537337677621e-02 */ + .quad 0x3FE0A8E5E039295C /* A00 = +5.206174258576470315063e-01 */ + .quad 0x3FDC35E1234583FE /* A01 = +4.407885403107342225937e-01 */ + .quad 0xBFC1DE034E31AEB9 /* A02 = -1.395877963835710222629e-01 */ + .quad 0x3F8EFDEBB3471BDC /* A03 = +1.513275280821162888101e-02 */ + .quad 0x3FE2851B603CB2A5 /* A00 = +5.787484054213406503564e-01 */ + .quad 0x3FD7F4A44ABBB286 /* A01 = +3.743067483726821853551e-01 */ + .quad 0xBFBD3EEB67087DE7 /* A02 = -1.142413260026767657385e-01 */ + .quad 0x3F8864F38329E8BD /* A03 = +1.191129917173260922836e-02 */ + .quad 0x3FE437DBE3C34AC1 /* A00 = +6.318187187665317283702e-01 */ + .quad 0x3FD43F6F789441B5 /* A01 = +3.163717916040938438194e-01 */ + .quad 0xBFB7D92E7901B9A4 /* A02 = -9.315767721429907277653e-02 */ + .quad 0x3F8327ED342308E1 /* A03 = +9.353497651663324544136e-03 */ + .quad 0x3FE5C0977766D55C /* A00 = +6.797597248138731451661e-01 */ + .quad 0x3FD10B42A764D8F9 /* A01 = +2.663122782427219115142e-01 */ + .quad 0xBFB3633351D3D70F /* A02 = -7.573242900602060456716e-02 */ + .quad 0x3F7E079E30FF899C /* A03 = +7.331483779099558922843e-03 */ + .quad 0x3FE7202CE08A88C4 /* A00 = +7.226776490754436288455e-01 */ + .quad 0x3FCC973EB5662B01 /* A01 = +2.233656297433626314319e-01 */ + .quad 0xBFAF70A455F9920B /* A02 = -6.140626477716545211782e-02 */ + .quad 0x3F77812411CE99B6 /* A03 = +5.738392731393584730859e-03 */ + .quad 0x3FE85879424095B1 /* A00 = +7.608000082006382003286e-01 */ + .quad 0x3FC7E73BD1674D84 /* A01 = +1.867441914060742336190e-01 */ + .quad 0xBFA96F84E4BF333B /* A02 = -4.967894832916504993525e-02 */ + .quad 0x3F72606DDCA6E117 /* A03 = +4.486493251924870105662e-03 */ + .quad 0x3FE96BFE4957F4DD /* A00 = +7.944327766887472330737e-01 */ + .quad 0x3FC3ED4780D25478 /* A01 = +1.556786898624158421711e-01 */ + .quad 0xBFA489C5F9A56B58 /* A02 = -4.011362717093075458408e-02 */ + .quad 0x3F6CB5DC17E9AD2A /* A03 = +3.504686231556104931972e-03 */ + .quad 0x3FEA5D9CB2F41234 /* A00 = +8.239272589858672724006e-01 */ + .quad 0x3FC091A758374DCF /* A01 = +1.294449978582705440555e-01 */ + .quad 0xBFA08E436D4B5CE0 /* A02 = -3.233538350257858517978e-02 */ + .quad 0x3F666997AD53E6B7 /* A03 = +2.735897297154145629133e-03 */ + .quad 0x3FEB3060342CB850 /* A00 = +8.496552485501158713532e-01 */ + .quad 0x3FBB7D30BBC7DC1B /* A01 = +1.073790033768634993860e-01 */ + .quad 0xBF9AA6BA3443D9E3 /* A02 = -2.602663940430173170060e-02 */ + .quad 0x3F617CA764B7850B /* A03 = +2.134634914668814050648e-03 */ + .quad 0x3FEBE759A6A0C7B8 /* A00 = +8.719909910635044170135e-01 */ + .quad 0x3FB6C10DE6A703FF /* A01 = +8.888327485239243264115e-02 */ + .quad 0xBF956C566D8BE1F6 /* A02 = -2.092108768099084498138e-02 */ + .quad 0x3F5B46D1A4A59CF8 /* A03 = +1.664833764687232917079e-03 */ + .quad 0x3FEC858494887A04 /* A00 = +8.912985707318630268503e-01 */ + .quad 0x3FB2CC31F543394D /* A01 = +7.342827070099140762682e-02 */ + .quad 0xBF9133477FF69137 /* A02 = -1.679717749142747504343e-02 */ + .quad 0x3F5544482FBB4DA5 /* A03 = +1.298017973501022466823e-03 */ + .quad 0x3FED0DB59D0E32E9 /* A00 = +9.079235141267335551518e-01 */ + .quad 0x3FAF006BAFFC6EF4 /* A01 = +6.055008433597022787787e-02 */ + .quad 0xBF8B97146FA2B97A /* A02 = -1.347175565419144252499e-02 */ + .quad 0x3F5093B01F4CDC69 /* A03 = +1.011774057770665211434e-03 */ + .quad 0x3FEDB487C3EC457C /* A00 = +9.282873942012623835751e-01 */ + .quad 0x3FA7390C09D0BD1D /* A01 = +4.535710925881118044112e-02 */ + .quad 0xBF83D9F7C3181106 /* A02 = -9.693084374710735778846e-03 */ + .quad 0x3F46E34A0A3C0E64 /* A03 = +6.984817050299072134500e-04 */ + .quad 0x3FEE5FFCB4E6EB00 /* A00 = +9.492171796076434020506e-01 */ + .quad 0x3F9F4913ED00AADF /* A01 = +3.055220731782070861526e-02 */ + .quad 0xBF79670BD0E59B5C /* A02 = -6.201788097633133961528e-03 */ + .quad 0x3F3BC998EBCAF96D /* A03 = +4.240034429975534616304e-04 */ + .quad 0x3FEEDBA41E9542FE /* A00 = +9.643116566968215064293e-01 */ + .quad 0x3F94F5DD18D9C24D /* A01 = +2.046914543319848858727e-02 */ + .quad 0xBF7034896AA122B9 /* A02 = -3.956352980886528904192e-03 */ + .quad 0x3F30DCCB47810B39 /* A03 = +2.573009765038273091199e-04 */ + .quad 0x3FEF33F2882520ED /* A00 = +9.750912341196716903724e-01 */ + .quad 0x3F8BF37F2CF553FF /* A01 = +1.364802699996836392315e-02 */ + .quad 0xBF649F6F05A69619 /* A02 = -2.517430152880317534986e-03 */ + .quad 0x3F247623C950AAC9 /* A03 = +1.561087307505231250044e-04 */ + .quad 0x3FEF727757751741 /* A00 = +9.827229221489021115943e-01 */ + .quad 0x3F828E67912C4400 /* A01 = +9.060677640748693306705e-03 */ + .quad 0xBF5A2F51A806CC2C /* A02 = -1.598195784123355826789e-03 */ + .quad 0x3F18D35D7687E613 /* A03 = +9.470231965016282719549e-05 */ + .quad 0x3FEF9E6325C5942A /* A00 = +9.880843866091073568469e-01 */ + .quad 0x3F788AB117618F76 /* A01 = +5.991641772286606867914e-03 */ + .quad 0xBF5096EAB0B1EA89 /* A02 = -1.012543859160305046233e-03 */ + .quad 0x3F0E1E50EC4435AB /* A03 = +5.744633156910412119652e-05 */ + .quad 0x3FEFBD0784049369 /* A00 = +9.918248728250605994461e-01 */ + .quad 0x3F702BBD8294035F /* A01 = +3.947963975634432264028e-03 */ + .quad 0xBF44FB55E0F00593 /* A02 = -6.403130845457509273330e-04 */ + .quad 0x3F0244DCD723230A /* A03 = +3.484534217219031730379e-05 */ + .quad 0x3FEFD245E2366A43 /* A00 = +9.944180887426415926811e-01 */ + .quad 0x3F653D82EC088433 /* A01 = +2.592807490387838333795e-03 */ + .quad 0xBF3A7DF75E013CB8 /* A02 = -4.042366908878036561859e-04 */ + .quad 0x3EF6298E69F991CD /* A03 = +2.113564425911141559972e-05 */ + .quad 0x3FEFE0EAA508BC69 /* A00 = +9.962056372950317539861e-01 */ + .quad 0x3F5BD0771AF3FDDA /* A01 = +1.697651208644282514598e-03 */ + .quad 0xBF30B2E1254DE571 /* A02 = -2.548026725928887099328e-04 */ + .quad 0x3EEAE28B70EC0256 /* A03 = +1.281973848454955042307e-05 */ + .quad 0x3FEFEAF5303D7F96 /* A00 = +9.974313680831865536192e-01 */ + .quad 0x3F5229111365657E /* A01 = +1.108423877289460134782e-03 */ + .quad 0xBF250572D04DFE66 /* A02 = -1.603796628408704519168e-04 */ + .quad 0x3EE04E89BB57C981 /* A03 = +7.775682983689149966743e-06 */ + .quad 0x3FEFF1CF52F1CF44 /* A00 = +9.982678051005469122003e-01 */ + .quad 0x3F47A71316147CEB /* A01 = +7.218211359577819110842e-04 */ + .quad 0xBF1A6D7604055719 /* A02 = -1.008132248946049582547e-04 */ + .quad 0x3ED3C8047586A85C /* A03 = +4.716233739913014633626e-06 */ + .quad 0x3FEFF6770369EF69 /* A00 = +9.988360468555416149528e-01 */ + .quad 0x3F3EBB261180FBF0 /* A01 = +4.689186039321105101130e-04 */ + .quad 0xBF1097754FE19D7F /* A02 = -6.329206004950480057066e-05 */ + .quad 0x3EC7FEFF83BCA0A7 /* A03 = +2.860556404988488738366e-06 */ + .quad 0x3FEFF99D42371AC4 /* A00 = +9.992204945818561334647e-01 */ + .quad 0x3F33EB2AEC271F59 /* A01 = +3.039340773764907474054e-04 */ + .quad 0xBF04CF18E0FC0D79 /* A02 = -3.968996690952969588805e-05 */ + .quad 0x3EBD1BDBD6019BE9 /* A03 = +1.735021065507727833886e-06 */ + .quad 0x3FEFFBBCA32B0D91 /* A00 = +9.994795977476532700123e-01 */ + .quad 0x3F29C41E1615110A /* A01 = +1.965796209707565346710e-04 */ + .quad 0xBEFA11F93D9DCB5A /* A02 = -2.486248909101414873235e-05 */ + .quad 0x3EB1A7CA4546F7A7 /* A03 = +1.052345642723709228769e-06 */ + .quad 0x3FEFFD298B8E8DE2 /* A00 = +9.996535993308806045121e-01 */ + .quad 0x3F20A1C42D523C5B /* A01 = +1.268913244172078754520e-04 */ + .quad 0xBEF0507A364AFAE4 /* A02 = -1.555859070622834605755e-05 */ + .quad 0x3EA56ACA17E7CDF4 /* A03 = +6.382806956848098872313e-07 */ + .quad 0x3FEFFE1DC82BA5A3 /* A00 = +9.997700604991915929176e-01 */ + .quad 0x3F156E73B90F1769 /* A01 = +8.175450626798714452801e-05 */ + .quad 0xBEE4663579D0A09F /* A02 = -9.727122057226747625365e-06 */ + .quad 0x3E99FAF6FEC5D4C1 /* A03 = +3.871371052824002996020e-07 */ + .quad 0x3FEFFEF8D0BB5E81 /* A00 = +9.998745037837154514548e-01 */ + .quad 0x3F06686DA18D39C3 /* A01 = +4.273972098777251447726e-05 */ + .quad 0xBED46BC298073E90 /* A02 = -4.868731025855742842491e-06 */ + .quad 0x3E88E42286B9D0FD /* A03 = +1.854535328530838170114e-07 */ + .quad 0x3FEFFF8DBC68DDC7 /* A00 = +9.999455146670975791423e-01 */ + .quad 0x3EF26B2953A80AF0 /* A01 = +1.756534514108903368909e-05 */ + .quad 0xBEBFC4472D580F83 /* A02 = -1.893443529411295465239e-06 */ + .quad 0x3E72505B4553D19F /* A03 = +6.822456673547912277047e-08 */ + .quad 0x3FEFFFCED1276609 /* A00 = +9.999765477215883935358e-01 */ + .quad 0x3EDE1A94C7CC58F5 /* A01 = +7.177313020153979672606e-06 */ + .quad 0xBEA8A2C988744E57 /* A02 = -7.342066660497443762363e-07 */ + .quad 0x3E5AF30036BBBAF4 /* A03 = +2.509841882843541084885e-08 */ + .quad 0x3FEFFFEAFE70FCFC /* A00 = +9.999899835164849370983e-01 */ + .quad 0x3EC879175E3549F5 /* A01 = +2.917410471128503564412e-06 */ + .quad 0xBE930E36677D1813 /* A02 = -2.839493400307523115929e-07 */ + .quad 0x3E43D4005B42D48F /* A03 = +9.233192745401904898013e-09 */ + .quad 0x3ff0000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .quad 0x0000000000000000 + .align 32 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 /* _sSignMask */ + .align 32 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff /* _sAbsMask */ + .align 32 + .long 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000, 0x7ff80000 /* _iExpMantMask */ + .align 32 + .long 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000, 0x7f000000 /* _iExpMask */ + .align 32 + .long 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000, 0x3cf80000 /* _iMinIdxOfsMask */ + .align 32 + .long 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000, 0x04280000 /* _iMaxIdxMask */ + .align 32 + .type __svml_stanh_data_internal,@object + .size __svml_stanh_data_internal,.-__svml_stanh_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_tanh2_core.S b/sysdeps/x86_64/fpu/svml_d_tanh2_core.S new file mode 100644 index 0000000000..c703131777 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_tanh2_core.S @@ -0,0 +1,29 @@ +/* Function tanh vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_tanh) +WRAPPER_IMPL_SSE2 tanh +END (_ZGVbN2v_tanh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_tanh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_tanh4_core.S b/sysdeps/x86_64/fpu/svml_d_tanh4_core.S new file mode 100644 index 0000000000..fb293f4dba --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_tanh4_core.S @@ -0,0 +1,29 @@ +/* Function tanh vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_tanh) +WRAPPER_IMPL_AVX _ZGVbN2v_tanh +END (_ZGVdN4v_tanh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_tanh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_tanh4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_tanh4_core_avx.S new file mode 100644 index 0000000000..5385a2c27c --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_tanh4_core_avx.S @@ -0,0 +1,25 @@ +/* Function tanh vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_tanh) +WRAPPER_IMPL_AVX _ZGVbN2v_tanh +END (_ZGVcN4v_tanh) diff --git a/sysdeps/x86_64/fpu/svml_d_tanh8_core.S b/sysdeps/x86_64/fpu/svml_d_tanh8_core.S new file mode 100644 index 0000000000..9dafa7bb9a --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_tanh8_core.S @@ -0,0 +1,25 @@ +/* Function tanh vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_tanh) +WRAPPER_IMPL_AVX512 _ZGVdN4v_tanh +END (_ZGVeN8v_tanh) diff --git a/sysdeps/x86_64/fpu/svml_s_tanhf16_core.S b/sysdeps/x86_64/fpu/svml_s_tanhf16_core.S new file mode 100644 index 0000000000..19d51365e8 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_tanhf16_core.S @@ -0,0 +1,25 @@ +/* Function tanhf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_tanhf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_tanhf +END (_ZGVeN16v_tanhf) diff --git a/sysdeps/x86_64/fpu/svml_s_tanhf4_core.S b/sysdeps/x86_64/fpu/svml_s_tanhf4_core.S new file mode 100644 index 0000000000..6b98950f84 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_tanhf4_core.S @@ -0,0 +1,29 @@ +/* Function tanhf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_tanhf) +WRAPPER_IMPL_SSE2 tanhf +END (_ZGVbN4v_tanhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_tanhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_tanhf8_core.S b/sysdeps/x86_64/fpu/svml_s_tanhf8_core.S new file mode 100644 index 0000000000..3ada061ae0 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_tanhf8_core.S @@ -0,0 +1,29 @@ +/* Function tanhf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_tanhf) +WRAPPER_IMPL_AVX _ZGVbN4v_tanhf +END (_ZGVdN8v_tanhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_tanhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_tanhf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_tanhf8_core_avx.S new file mode 100644 index 0000000000..255d45952d --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_tanhf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function tanhf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_tanhf) +WRAPPER_IMPL_AVX _ZGVbN4v_tanhf +END (_ZGVcN8v_tanhf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx.c new file mode 100644 index 0000000000..a456c574e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-tanh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx2.c new file mode 100644 index 0000000000..a456c574e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-tanh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx512f.c new file mode 100644 index 0000000000..a456c574e2 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-tanh-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-tanh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-tanh.c b/sysdeps/x86_64/fpu/test-double-libmvec-tanh.c new file mode 100644 index 0000000000..4cb6a169d8 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-tanh.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC tanh +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index 9d91ccfe51..f53bb6813e 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVbN2v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVbN2v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVbN2v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVbN2v_erf) +VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVbN2v_tanh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 9e86d5fef8..0452c3db38 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -47,6 +47,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVdN4v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVdN4v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVdN4v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVdN4v_erf) +VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVdN4v_tanh) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 0f4ef00de4..197d5afc88 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVcN4v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVcN4v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVcN4v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVcN4v_erf) +VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVcN4v_tanh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index 975dff85af..e56ece640c 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1p), _ZGVeN8v_log1p) VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVeN8v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVeN8v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVeN8v_erf) +VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVeN8v_tanh) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx.c new file mode 100644 index 0000000000..254f9201aa --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-tanhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx2.c new file mode 100644 index 0000000000..254f9201aa --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-tanhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx512f.c new file mode 100644 index 0000000000..254f9201aa --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-tanhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-tanhf.c b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf.c new file mode 100644 index 0000000000..9a61ee8f9c --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-tanhf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC tanhf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index 2b1e27391a..abbebf9993 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVeN16v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVeN16v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVeN16v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVeN16v_erff) +VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVeN16v_tanhf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index 78428bf517..ae1c8b98c2 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVbN4v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVbN4v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVbN4v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVbN4v_erff) +VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVbN4v_tanhf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index dadd4e6ca0..eb477a0371 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -47,6 +47,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVdN8v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVdN8v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVdN8v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVdN8v_erff) +VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVdN8v_tanhf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 7b2d583e54..944f7f0a75 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -44,6 +44,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (log1pf), _ZGVcN8v_log1pf) VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVcN8v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVcN8v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVcN8v_erff) +VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVcN8v_tanhf) #define VEC_INT_TYPE __m128i From patchwork Tue Dec 28 20:11:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Pandey X-Patchwork-Id: 1573825 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=BkZE90um; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4JNmJx5ZlJz9sVq for ; Wed, 29 Dec 2021 07:26:25 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 73E623858417 for ; Tue, 28 Dec 2021 20:26:23 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 73E623858417 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1640723183; bh=vWtL1qw/EgoZVi20TsZiE+aQhaKew+f+2jeo0pyzuxI=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=BkZE90umnvDTmGvrfGX1Vamx0UkhRA5XEuGQzsqwQWAOqNUoahqdbwWAj6MYj0IAq VQca1l7MVpLEGL6ir2UqRq3MQVAMfa+Gg74XUnLg0bzaEbSDgy5rqLQrps7O7BupBn tl+p2ivhAp2m1oiYdqOopbj2rOAhR4HY6TKwUirI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id CBE833858406 for ; Tue, 28 Dec 2021 20:11:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CBE833858406 X-IronPort-AV: E=McAfee;i="6200,9189,10211"; a="228246036" X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="228246036" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Dec 2021 12:11:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,242,1635231600"; d="scan'208";a="554319602" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga001.jf.intel.com with ESMTP; 28 Dec 2021 12:11:33 -0800 Received: from gskx-1.sc.intel.com (gskx-1.sc.intel.com [172.25.149.211]) by scymds01.sc.intel.com with ESMTP id 1BSKBUsm016522; Tue, 28 Dec 2021 12:11:33 -0800 To: libc-alpha@sourceware.org Subject: [PATCH v4 18/18] x86-64: Add vector asinh/asinhf implementation to libmvec Date: Tue, 28 Dec 2021 12:11:30 -0800 Message-Id: <20211228201130.737370-19-skpgkp2@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211228201130.737370-1-skpgkp2@gmail.com> References: <20211228201130.737370-1-skpgkp2@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, HK_RANDOM_ENVFROM, HK_RANDOM_FROM, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_SHORT, KAM_STOCKGEN, NML_ADSP_CUSTOM_MED, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Sunil K Pandey via Libc-alpha From: Sunil Pandey Reply-To: Sunil K Pandey Cc: andrey.kolesov@intel.com, marius.cornea@intel.com Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Implement vectorized asinh/asinhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector asinh/asinhf with regenerated ulps. --- bits/libm-simd-decl-stubs.h | 11 + math/bits/mathcalls.h | 2 +- .../unix/sysv/linux/x86_64/libmvec.abilist | 8 + sysdeps/x86/fpu/bits/math-vector.h | 4 + .../x86/fpu/finclude/math-vector-fortran.h | 4 + sysdeps/x86_64/fpu/Makeconfig | 1 + sysdeps/x86_64/fpu/Versions | 2 + sysdeps/x86_64/fpu/libm-test-ulps | 17 + .../fpu/multiarch/svml_d_asinh2_core-sse2.S | 20 + .../x86_64/fpu/multiarch/svml_d_asinh2_core.c | 27 + .../fpu/multiarch/svml_d_asinh2_core_sse4.S | 1659 +++++++++++++++++ .../fpu/multiarch/svml_d_asinh4_core-sse.S | 20 + .../x86_64/fpu/multiarch/svml_d_asinh4_core.c | 27 + .../fpu/multiarch/svml_d_asinh4_core_avx2.S | 1598 ++++++++++++++++ .../fpu/multiarch/svml_d_asinh8_core-avx2.S | 20 + .../x86_64/fpu/multiarch/svml_d_asinh8_core.c | 27 + .../fpu/multiarch/svml_d_asinh8_core_avx512.S | 510 +++++ .../fpu/multiarch/svml_s_asinhf16_core-avx2.S | 20 + .../fpu/multiarch/svml_s_asinhf16_core.c | 28 + .../multiarch/svml_s_asinhf16_core_avx512.S | 476 +++++ .../fpu/multiarch/svml_s_asinhf4_core-sse2.S | 20 + .../fpu/multiarch/svml_s_asinhf4_core.c | 28 + .../fpu/multiarch/svml_s_asinhf4_core_sse4.S | 509 +++++ .../fpu/multiarch/svml_s_asinhf8_core-sse.S | 20 + .../fpu/multiarch/svml_s_asinhf8_core.c | 28 + .../fpu/multiarch/svml_s_asinhf8_core_avx2.S | 457 +++++ sysdeps/x86_64/fpu/svml_d_asinh2_core.S | 29 + sysdeps/x86_64/fpu/svml_d_asinh4_core.S | 29 + sysdeps/x86_64/fpu/svml_d_asinh4_core_avx.S | 25 + sysdeps/x86_64/fpu/svml_d_asinh8_core.S | 25 + sysdeps/x86_64/fpu/svml_s_asinhf16_core.S | 25 + sysdeps/x86_64/fpu/svml_s_asinhf4_core.S | 29 + sysdeps/x86_64/fpu/svml_s_asinhf8_core.S | 29 + sysdeps/x86_64/fpu/svml_s_asinhf8_core_avx.S | 25 + .../fpu/test-double-libmvec-asinh-avx.c | 1 + .../fpu/test-double-libmvec-asinh-avx2.c | 1 + .../fpu/test-double-libmvec-asinh-avx512f.c | 1 + .../x86_64/fpu/test-double-libmvec-asinh.c | 3 + .../x86_64/fpu/test-double-vlen2-wrappers.c | 1 + .../fpu/test-double-vlen4-avx2-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen4-wrappers.c | 1 + .../x86_64/fpu/test-double-vlen8-wrappers.c | 1 + .../fpu/test-float-libmvec-asinhf-avx.c | 1 + .../fpu/test-float-libmvec-asinhf-avx2.c | 1 + .../fpu/test-float-libmvec-asinhf-avx512f.c | 1 + .../x86_64/fpu/test-float-libmvec-asinhf.c | 3 + .../x86_64/fpu/test-float-vlen16-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen4-wrappers.c | 1 + .../fpu/test-float-vlen8-avx2-wrappers.c | 1 + .../x86_64/fpu/test-float-vlen8-wrappers.c | 1 + 50 files changed, 5778 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core-avx2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core_avx512.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core-sse2.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core_sse4.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core-sse.S create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core.c create mode 100644 sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core_avx2.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asinh2_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asinh4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asinh4_core_avx.S create mode 100644 sysdeps/x86_64/fpu/svml_d_asinh8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinhf16_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinhf4_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinhf8_core.S create mode 100644 sysdeps/x86_64/fpu/svml_s_asinhf8_core_avx.S create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-double-libmvec-asinh.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx2.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx512f.c create mode 100644 sysdeps/x86_64/fpu/test-float-libmvec-asinhf.c diff --git a/bits/libm-simd-decl-stubs.h b/bits/libm-simd-decl-stubs.h index 21f1a43232..bcaddb7a0e 100644 --- a/bits/libm-simd-decl-stubs.h +++ b/bits/libm-simd-decl-stubs.h @@ -296,4 +296,15 @@ #define __DECL_SIMD_tanhf32x #define __DECL_SIMD_tanhf64x #define __DECL_SIMD_tanhf128x + +#define __DECL_SIMD_asinh +#define __DECL_SIMD_asinhf +#define __DECL_SIMD_asinhl +#define __DECL_SIMD_asinhf16 +#define __DECL_SIMD_asinhf32 +#define __DECL_SIMD_asinhf64 +#define __DECL_SIMD_asinhf128 +#define __DECL_SIMD_asinhf32x +#define __DECL_SIMD_asinhf64x +#define __DECL_SIMD_asinhf128x #endif diff --git a/math/bits/mathcalls.h b/math/bits/mathcalls.h index 3d1c2056d5..40e055e579 100644 --- a/math/bits/mathcalls.h +++ b/math/bits/mathcalls.h @@ -84,7 +84,7 @@ __MATHDECL_VEC (void,sincos,, /* Hyperbolic arc cosine of X. */ __MATHCALL_VEC (acosh,, (_Mdouble_ __x)); /* Hyperbolic arc sine of X. */ -__MATHCALL (asinh,, (_Mdouble_ __x)); +__MATHCALL_VEC (asinh,, (_Mdouble_ __x)); /* Hyperbolic arc tangent of X. */ __MATHCALL_VEC (atanh,, (_Mdouble_ __x)); #endif diff --git a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist index e178cef683..df265d6a12 100644 --- a/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/libmvec.abilist @@ -49,6 +49,7 @@ GLIBC_2.22 _ZGVeN8vvv_sincos F GLIBC_2.35 _ZGVbN2v_acos F GLIBC_2.35 _ZGVbN2v_acosh F GLIBC_2.35 _ZGVbN2v_asin F +GLIBC_2.35 _ZGVbN2v_asinh F GLIBC_2.35 _ZGVbN2v_atan F GLIBC_2.35 _ZGVbN2v_atanh F GLIBC_2.35 _ZGVbN2v_cbrt F @@ -67,6 +68,7 @@ GLIBC_2.35 _ZGVbN2vv_hypot F GLIBC_2.35 _ZGVbN4v_acosf F GLIBC_2.35 _ZGVbN4v_acoshf F GLIBC_2.35 _ZGVbN4v_asinf F +GLIBC_2.35 _ZGVbN4v_asinhf F GLIBC_2.35 _ZGVbN4v_atanf F GLIBC_2.35 _ZGVbN4v_atanhf F GLIBC_2.35 _ZGVbN4v_cbrtf F @@ -85,6 +87,7 @@ GLIBC_2.35 _ZGVbN4vv_hypotf F GLIBC_2.35 _ZGVcN4v_acos F GLIBC_2.35 _ZGVcN4v_acosh F GLIBC_2.35 _ZGVcN4v_asin F +GLIBC_2.35 _ZGVcN4v_asinh F GLIBC_2.35 _ZGVcN4v_atan F GLIBC_2.35 _ZGVcN4v_atanh F GLIBC_2.35 _ZGVcN4v_cbrt F @@ -103,6 +106,7 @@ GLIBC_2.35 _ZGVcN4vv_hypot F GLIBC_2.35 _ZGVcN8v_acosf F GLIBC_2.35 _ZGVcN8v_acoshf F GLIBC_2.35 _ZGVcN8v_asinf F +GLIBC_2.35 _ZGVcN8v_asinhf F GLIBC_2.35 _ZGVcN8v_atanf F GLIBC_2.35 _ZGVcN8v_atanhf F GLIBC_2.35 _ZGVcN8v_cbrtf F @@ -121,6 +125,7 @@ GLIBC_2.35 _ZGVcN8vv_hypotf F GLIBC_2.35 _ZGVdN4v_acos F GLIBC_2.35 _ZGVdN4v_acosh F GLIBC_2.35 _ZGVdN4v_asin F +GLIBC_2.35 _ZGVdN4v_asinh F GLIBC_2.35 _ZGVdN4v_atan F GLIBC_2.35 _ZGVdN4v_atanh F GLIBC_2.35 _ZGVdN4v_cbrt F @@ -139,6 +144,7 @@ GLIBC_2.35 _ZGVdN4vv_hypot F GLIBC_2.35 _ZGVdN8v_acosf F GLIBC_2.35 _ZGVdN8v_acoshf F GLIBC_2.35 _ZGVdN8v_asinf F +GLIBC_2.35 _ZGVdN8v_asinhf F GLIBC_2.35 _ZGVdN8v_atanf F GLIBC_2.35 _ZGVdN8v_atanhf F GLIBC_2.35 _ZGVdN8v_cbrtf F @@ -157,6 +163,7 @@ GLIBC_2.35 _ZGVdN8vv_hypotf F GLIBC_2.35 _ZGVeN16v_acosf F GLIBC_2.35 _ZGVeN16v_acoshf F GLIBC_2.35 _ZGVeN16v_asinf F +GLIBC_2.35 _ZGVeN16v_asinhf F GLIBC_2.35 _ZGVeN16v_atanf F GLIBC_2.35 _ZGVeN16v_atanhf F GLIBC_2.35 _ZGVeN16v_cbrtf F @@ -175,6 +182,7 @@ GLIBC_2.35 _ZGVeN16vv_hypotf F GLIBC_2.35 _ZGVeN8v_acos F GLIBC_2.35 _ZGVeN8v_acosh F GLIBC_2.35 _ZGVeN8v_asin F +GLIBC_2.35 _ZGVeN8v_asinh F GLIBC_2.35 _ZGVeN8v_atan F GLIBC_2.35 _ZGVeN8v_atanh F GLIBC_2.35 _ZGVeN8v_cbrt F diff --git a/sysdeps/x86/fpu/bits/math-vector.h b/sysdeps/x86/fpu/bits/math-vector.h index 3c657f6108..71b7d660db 100644 --- a/sysdeps/x86/fpu/bits/math-vector.h +++ b/sysdeps/x86/fpu/bits/math-vector.h @@ -130,6 +130,10 @@ # define __DECL_SIMD_tanh __DECL_SIMD_x86_64 # undef __DECL_SIMD_tanhf # define __DECL_SIMD_tanhf __DECL_SIMD_x86_64 +# undef __DECL_SIMD_asinh +# define __DECL_SIMD_asinh __DECL_SIMD_x86_64 +# undef __DECL_SIMD_asinhf +# define __DECL_SIMD_asinhf __DECL_SIMD_x86_64 # endif #endif diff --git a/sysdeps/x86/fpu/finclude/math-vector-fortran.h b/sysdeps/x86/fpu/finclude/math-vector-fortran.h index c7f81945fe..4d3afdf753 100644 --- a/sysdeps/x86/fpu/finclude/math-vector-fortran.h +++ b/sysdeps/x86/fpu/finclude/math-vector-fortran.h @@ -64,6 +64,8 @@ !GCC$ builtin (erff) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (tanh) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (tanhf) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (asinh) attributes simd (notinbranch) if('x86_64') +!GCC$ builtin (asinhf) attributes simd (notinbranch) if('x86_64') !GCC$ builtin (cos) attributes simd (notinbranch) if('x32') !GCC$ builtin (cosf) attributes simd (notinbranch) if('x32') @@ -113,3 +115,5 @@ !GCC$ builtin (erff) attributes simd (notinbranch) if('x32') !GCC$ builtin (tanh) attributes simd (notinbranch) if('x32') !GCC$ builtin (tanhf) attributes simd (notinbranch) if('x32') +!GCC$ builtin (asinh) attributes simd (notinbranch) if('x32') +!GCC$ builtin (asinhf) attributes simd (notinbranch) if('x32') diff --git a/sysdeps/x86_64/fpu/Makeconfig b/sysdeps/x86_64/fpu/Makeconfig index 26df8d47bf..2ff33c7dd8 100644 --- a/sysdeps/x86_64/fpu/Makeconfig +++ b/sysdeps/x86_64/fpu/Makeconfig @@ -25,6 +25,7 @@ libmvec-funcs = \ acos \ acosh \ asin \ + asinh \ atan \ atan2 \ atanh \ diff --git a/sysdeps/x86_64/fpu/Versions b/sysdeps/x86_64/fpu/Versions index adcbe0fefb..e6ead13085 100644 --- a/sysdeps/x86_64/fpu/Versions +++ b/sysdeps/x86_64/fpu/Versions @@ -17,6 +17,7 @@ libmvec { _ZGVbN2v_acos; _ZGVcN4v_acos; _ZGVdN4v_acos; _ZGVeN8v_acos; _ZGVbN2v_acosh; _ZGVcN4v_acosh; _ZGVdN4v_acosh; _ZGVeN8v_acosh; _ZGVbN2v_asin; _ZGVcN4v_asin; _ZGVdN4v_asin; _ZGVeN8v_asin; + _ZGVbN2v_asinh; _ZGVcN4v_asinh; _ZGVdN4v_asinh; _ZGVeN8v_asinh; _ZGVbN2v_atan; _ZGVcN4v_atan; _ZGVdN4v_atan; _ZGVeN8v_atan; _ZGVbN2v_atanh; _ZGVcN4v_atanh; _ZGVdN4v_atanh; _ZGVeN8v_atanh; _ZGVbN2v_cbrt; _ZGVcN4v_cbrt; _ZGVdN4v_cbrt; _ZGVeN8v_cbrt; @@ -35,6 +36,7 @@ libmvec { _ZGVbN4v_acosf; _ZGVcN8v_acosf; _ZGVdN8v_acosf; _ZGVeN16v_acosf; _ZGVbN4v_acoshf; _ZGVcN8v_acoshf; _ZGVdN8v_acoshf; _ZGVeN16v_acoshf; _ZGVbN4v_asinf; _ZGVcN8v_asinf; _ZGVdN8v_asinf; _ZGVeN16v_asinf; + _ZGVbN4v_asinhf; _ZGVcN8v_asinhf; _ZGVdN8v_asinhf; _ZGVeN16v_asinhf; _ZGVbN4v_atanf; _ZGVcN8v_atanf; _ZGVdN8v_atanf; _ZGVeN16v_atanf; _ZGVbN4v_atanhf; _ZGVcN8v_atanhf; _ZGVdN8v_atanhf; _ZGVeN16v_atanhf; _ZGVbN4v_cbrtf; _ZGVcN8v_cbrtf; _ZGVdN8v_cbrtf; _ZGVeN16v_cbrtf; diff --git a/sysdeps/x86_64/fpu/libm-test-ulps b/sysdeps/x86_64/fpu/libm-test-ulps index bfaad7acef..71e9fced02 100644 --- a/sysdeps/x86_64/fpu/libm-test-ulps +++ b/sysdeps/x86_64/fpu/libm-test-ulps @@ -157,6 +157,23 @@ float: 3 float128: 4 ldouble: 5 +Function: "asinh_vlen2": +double: 1 + +Function: "asinh_vlen4": +double: 1 +float: 1 + +Function: "asinh_vlen4_avx2": +double: 1 + +Function: "asinh_vlen8": +double: 1 +float: 1 + +Function: "asinh_vlen8_avx2": +float: 1 + Function: "atan": double: 1 float: 1 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core-sse2.S new file mode 100644 index 0000000000..ddd1c3ca24 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized asinh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN2v_asinh _ZGVbN2v_asinh_sse2 +#include "../svml_d_asinh2_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core.c new file mode 100644 index 0000000000..37452d0f92 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized asinh, vector length is 2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN2v_asinh +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN2v_asinh, __GI__ZGVbN2v_asinh, __redirect__ZGVbN2v_asinh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core_sse4.S new file mode 100644 index 0000000000..afe9e04fcd --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh2_core_sse4.S @@ -0,0 +1,1659 @@ +/* Function asinh vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute asinh(x) as log(x + sqrt(x*x + 1)) + * + * Special cases: + * + * asinh(NaN) = quiet NaN, and raise invalid exception + * asinh(INF) = that INF + * asinh(0) = that 0 + * + */ + +/* Offsets for data table __svml_dasinh_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8208 +#define poly_coeff 12320 +#define ExpMask 12384 +#define Two10 12400 +#define MinLog1p 12416 +#define MaxLog1p 12432 +#define One 12448 +#define SgnMask 12464 +#define XThreshold 12480 +#define XhMask 12496 +#define Threshold 12512 +#define Bias 12528 +#define Bias1 12544 +#define ExpMask0 12560 +#define ExpMask2 12576 +#define L2 12592 +#define dBigThreshold 12608 +#define dC2 12624 +#define dC3 12640 +#define dC4 12656 +#define dC5 12672 +#define dHalf 12688 +#define dLargestFinite 12704 +#define dLittleThreshold 12720 +#define dSign 12736 +#define dThirtyOne 12752 +#define dTopMask12 12768 +#define dTopMask26 12784 +#define dTopMask29 12800 +#define XScale 12816 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN2v_asinh_sse4) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $64, %rsp + movaps %xmm0, %xmm13 + +/* + * Split X into high and low parts, XHi (<= 26 bits) and XLo (<= 27 bits) + * We could use either X or |X| here, but it doesn't seem to matter + */ + movups dTopMask26+__svml_dasinh_data_internal(%rip), %xmm15 + movaps %xmm13, %xmm7 + andps %xmm13, %xmm15 + lea -4218864+__svml_dasinh_data_internal(%rip), %rsi + +/* + * Compute X^2 = (XHi + XLo)^2 = XHi^2 + XLo * (X + XHi) + * The two parts are shifted off by around 26 bits. So even though + * the low bit will not in general be exact, it's near enough + */ + movaps %xmm15, %xmm8 + mulpd %xmm15, %xmm8 + subpd %xmm15, %xmm7 + addpd %xmm13, %xmm15 + +/* Load the constant 1 and a sign mask */ + movups One+__svml_dasinh_data_internal(%rip), %xmm12 + +/* + * Finally, express Y + W = X^2 + 1 accurately where Y has <= 29 bits. + * If |X| <= 1 then |XHi| <= 1 and so |X2Hi| <= 1, so we can treat 1 + * as the dominant component in the compensated summation. Otherwise, + * if |X| >= 1, then since X2Hi only has 52 significant bits, the basic + * addition will be exact anyway until we get to |X| >= 2^53. But by + * that time the log function is well-conditioned enough that the + * rounding error doesn't matter. Hence we can treat 1 as dominant even + * if it literally isn't. + */ + movaps %xmm12, %xmm3 + movaps %xmm12, %xmm5 + addpd %xmm8, %xmm3 + mulpd %xmm15, %xmm7 + subpd %xmm3, %xmm5 + movups dTopMask29+__svml_dasinh_data_internal(%rip), %xmm6 + andps %xmm3, %xmm6 + +/* + * Compute R = 1/sqrt(Y + W) * (1 + d) + * Force R to <= 12 significant bits in case it isn't already + * This means that R * Y and R^2 * Y are exactly representable. + */ + cvtpd2ps %xmm6, %xmm1 + addpd %xmm8, %xmm5 + subpd %xmm6, %xmm3 + +/* + * Unfortunately, we can still be in trouble if |X| <= 2^-10, since + * the absolute error 2^-(12+53)-ish in sqrt(1 + X^2) gets scaled up + * by 1/X and comes close to our threshold. Hence if |X| <= 2^-9, + * perform an alternative computation + * sqrt(1 + X^2) - 1 = X^2/2 - X^4/8 + X^6/16 + * X2 = X^2 + */ + addpd %xmm7, %xmm8 + addpd %xmm7, %xmm5 + movlhps %xmm1, %xmm1 + rsqrtps %xmm1, %xmm4 + addpd %xmm3, %xmm5 + cvtps2pd %xmm4, %xmm2 + andps dTopMask12+__svml_dasinh_data_internal(%rip), %xmm2 + +/* + * Compute e = -(2 * d + d^2) + * The first FMR is exact, and the rounding error in the other is acceptable + * since d and e are ~ 2^-12 + */ + movaps %xmm12, %xmm1 + +/* + * Compute S = (Y/sqrt(Y + W)) * (1 + d) + * and T = (W/sqrt(Y + W)) * (1 + d) + * so that S + T = sqrt(Y + W) * (1 + d) + * S is exact, and the rounding error in T is OK. + */ + mulpd %xmm2, %xmm6 + mulpd %xmm2, %xmm5 + movaps %xmm2, %xmm0 + +/* + * Obtain sqrt(1 + X^2) - 1 in two pieces + * sqrt(1 + X^2) - 1 + * = sqrt(Y + W) - 1 + * = (S + T) * (1 + Corr) - 1 + * = [S - 1] + [T + (S + T) * Corr] + * We need a compensated summation for the last part. We treat S - 1 + * as the larger part; it certainly is until about X < 2^-4, and in that + * case, the error is affordable since X dominates over sqrt(1 + X^2) - 1 + * Final sum is dTmp5 (hi) + dTmp7 (lo) + */ + movaps %xmm6, %xmm3 + mulpd %xmm6, %xmm0 + mulpd %xmm5, %xmm2 + subpd %xmm0, %xmm1 + addpd %xmm5, %xmm3 + subpd %xmm12, %xmm6 + subpd %xmm2, %xmm1 + movups SgnMask+__svml_dasinh_data_internal(%rip), %xmm9 + movaps %xmm12, %xmm4 + +/* + * Get the absolute value of the input, since we will exploit antisymmetry + * and mostly assume X >= 0 in the core computation + */ + movaps %xmm9, %xmm10 + andps %xmm13, %xmm10 + +/* + * Check whether the input is finite, by checking |X| <= MaxFloat + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN <= MaxFloat) + */ + movaps %xmm10, %xmm14 + +/* + * The following computation can go wrong for very large X, basically + * because X^2 overflows. But for large X we have + * asinh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when do do this. + */ + movaps %xmm10, %xmm11 + cmpnlepd dLargestFinite+__svml_dasinh_data_internal(%rip), %xmm14 + cmpltpd dBigThreshold+__svml_dasinh_data_internal(%rip), %xmm11 + movmskpd %xmm14, %edx + +/* + * Now 1 / (1 + d) + * = 1 / (1 + (sqrt(1 - e) - 1)) + * = 1 / sqrt(1 - e) + * = 1 + 1/2 * e + 3/8 * e^2 + 5/16 * e^3 + 35/128 * e^4 + + * 63/256 * e^5 + 231/1024 * e^6 + .... + * So compute the first five nonconstant terms of that, so that + * we have a relative correction (1 + Corr) to apply to S etc. + * C1 = 1/2 + * C2 = 3/8 + * C3 = 5/16 + * C4 = 35/128 + * C5 = 63/256 + */ + movups dC5+__svml_dasinh_data_internal(%rip), %xmm14 + movups dHalf+__svml_dasinh_data_internal(%rip), %xmm15 + mulpd %xmm1, %xmm14 + +/* dX2over2 = X^2/2 */ + mulpd %xmm15, %xmm8 + addpd dC4+__svml_dasinh_data_internal(%rip), %xmm14 + mulpd %xmm1, %xmm14 + addpd dC3+__svml_dasinh_data_internal(%rip), %xmm14 + mulpd %xmm1, %xmm14 + addpd dC2+__svml_dasinh_data_internal(%rip), %xmm14 + mulpd %xmm1, %xmm14 + addpd %xmm15, %xmm14 + mulpd %xmm14, %xmm1 + mulpd %xmm3, %xmm1 + addpd %xmm1, %xmm5 + addpd %xmm6, %xmm5 + +/* dX4over4 = X^4/4 */ + movaps %xmm8, %xmm6 + +/* dX46 = -X^4/4 + X^6/8 */ + movaps %xmm8, %xmm7 + mulpd %xmm8, %xmm6 + mulpd %xmm6, %xmm7 + subpd %xmm6, %xmm7 + +/* dX46over2 = -X^4/8 + x^6/16 */ + mulpd %xmm7, %xmm15 + +/* Now multiplex the two possible computations */ + movaps %xmm10, %xmm3 + cmplepd dLittleThreshold+__svml_dasinh_data_internal(%rip), %xmm3 + addpd %xmm15, %xmm8 + movaps %xmm3, %xmm1 + andps %xmm3, %xmm8 + andnps %xmm5, %xmm1 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + movaps %xmm12, %xmm5 + orps %xmm8, %xmm1 + movaps %xmm11, %xmm3 + +/* + * Now do another compensated sum to add |X| + [sqrt(1 + X^2) - 1]. + * It's always safe to assume |X| is larger. + * This is the final 2-part argument to the log1p function + */ + addpd %xmm10, %xmm1 + maxpd %xmm1, %xmm5 + minpd %xmm1, %xmm4 + +/* Now multiplex to the case X = 2^-30 * |input|, Xl = dL = 0 in the "big" case. */ + movups XScale+__svml_dasinh_data_internal(%rip), %xmm8 + andps %xmm9, %xmm1 + mulpd %xmm8, %xmm10 + cmpltpd XThreshold+__svml_dasinh_data_internal(%rip), %xmm1 + movaps %xmm5, %xmm9 + andnps %xmm10, %xmm3 + addpd %xmm4, %xmm9 + orps XhMask+__svml_dasinh_data_internal(%rip), %xmm1 + andps %xmm1, %xmm9 + subpd %xmm9, %xmm5 + andps %xmm11, %xmm9 + +/* Now resume the main code. */ + movups ExpMask+__svml_dasinh_data_internal(%rip), %xmm10 + orps %xmm9, %xmm3 + +/* preserve mantissa, set input exponent to 2^(-10) */ + andps %xmm3, %xmm10 + +/* exponent bits */ + movaps %xmm3, %xmm7 + orps Two10+__svml_dasinh_data_internal(%rip), %xmm10 + psrlq $20, %xmm7 + +/* reciprocal approximation good to at least 11 bits */ + cvtpd2ps %xmm10, %xmm1 + addpd %xmm5, %xmm4 + movlhps %xmm1, %xmm1 + andps %xmm11, %xmm4 + rcpps %xmm1, %xmm0 + cvtps2pd %xmm0, %xmm0 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + movups .FLT_30(%rip), %xmm6 + movaps %xmm11, %xmm1 + addpd %xmm6, %xmm0 + subpd %xmm6, %xmm0 + +/* exponent of X needed to scale Xl */ + movdqu ExpMask0+__svml_dasinh_data_internal(%rip), %xmm5 + +/* 2^ (-10-exp(X) ) */ + movdqu ExpMask2+__svml_dasinh_data_internal(%rip), %xmm2 + pand %xmm3, %xmm5 + psubq %xmm5, %xmm2 + +/* scale DblRcp */ + mulpd %xmm0, %xmm2 + +/* argument reduction */ + mulpd %xmm2, %xmm3 + mulpd %xmm2, %xmm4 + subpd %xmm12, %xmm3 + addpd %xmm4, %xmm3 + +/* polynomial */ + movups poly_coeff+__svml_dasinh_data_internal(%rip), %xmm12 + movaps %xmm3, %xmm2 + pshufd $221, %xmm7, %xmm8 + mulpd %xmm3, %xmm12 + +/* biased exponent in DP format */ + cvtdq2pd %xmm8, %xmm14 + addpd poly_coeff+16+__svml_dasinh_data_internal(%rip), %xmm12 + mulpd %xmm3, %xmm2 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + movups dThirtyOne+__svml_dasinh_data_internal(%rip), %xmm9 + +/* exponent*log(2.0) */ + movups Threshold+__svml_dasinh_data_internal(%rip), %xmm5 + addpd %xmm14, %xmm9 + cmpltpd %xmm0, %xmm5 + mulpd %xmm2, %xmm12 + andps %xmm11, %xmm14 + +/* + * prepare table index + * table lookup + */ + movaps %xmm0, %xmm11 + movups poly_coeff+32+__svml_dasinh_data_internal(%rip), %xmm0 + andnps %xmm9, %xmm1 + mulpd %xmm3, %xmm0 + addpd poly_coeff+48+__svml_dasinh_data_internal(%rip), %xmm0 + addpd %xmm12, %xmm0 + +/* reconstruction */ + mulpd %xmm0, %xmm2 + andps Bias+__svml_dasinh_data_internal(%rip), %xmm5 + psrlq $40, %xmm11 + orps Bias1+__svml_dasinh_data_internal(%rip), %xmm5 + orps %xmm14, %xmm1 + movd %xmm11, %eax + pshufd $2, %xmm11, %xmm11 + +/* Finally, reincorporate the original sign. */ + movups dSign+__svml_dasinh_data_internal(%rip), %xmm0 + subpd %xmm5, %xmm1 + addpd %xmm2, %xmm3 + movd %xmm11, %ecx + mulpd L2+__svml_dasinh_data_internal(%rip), %xmm1 + movslq %eax, %rax + andps %xmm13, %xmm0 + movslq %ecx, %rcx + movsd (%rsi,%rax), %xmm6 + movhpd (%rsi,%rcx), %xmm6 + addpd %xmm3, %xmm6 + addpd %xmm6, %xmm1 + pxor %xmm1, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx xmm0 xmm13 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm13, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $2, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -48; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xd0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -56; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -64; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call asinh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 48(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVbN2v_asinh_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_dasinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(16)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(16)) VUINT32 poly_coeff[4][2][2]; + __declspec(align(16)) VUINT32 ExpMask[2][2]; + __declspec(align(16)) VUINT32 Two10[2][2]; + __declspec(align(16)) VUINT32 MinLog1p[2][2]; + __declspec(align(16)) VUINT32 MaxLog1p[2][2]; + __declspec(align(16)) VUINT32 One[2][2]; + __declspec(align(16)) VUINT32 SgnMask[2][2]; + __declspec(align(16)) VUINT32 XThreshold[2][2]; + __declspec(align(16)) VUINT32 XhMask[2][2]; + __declspec(align(16)) VUINT32 Threshold[2][2]; + __declspec(align(16)) VUINT32 Bias[2][2]; + __declspec(align(16)) VUINT32 Bias1[2][2]; + __declspec(align(16)) VUINT32 ExpMask0[2][2]; + __declspec(align(16)) VUINT32 ExpMask2[2][2]; + __declspec(align(16)) VUINT32 L2[2][2]; + __declspec(align(16)) VUINT32 dBigThreshold[2][2]; + __declspec(align(16)) VUINT32 dC2[2][2]; + __declspec(align(16)) VUINT32 dC3[2][2]; + __declspec(align(16)) VUINT32 dC4[2][2]; + __declspec(align(16)) VUINT32 dC5[2][2]; + __declspec(align(16)) VUINT32 dHalf[2][2]; + __declspec(align(16)) VUINT32 dLargestFinite[2][2]; + __declspec(align(16)) VUINT32 dLittleThreshold[2][2]; + __declspec(align(16)) VUINT32 dSign[2][2]; + __declspec(align(16)) VUINT32 dThirtyOne[2][2]; + __declspec(align(16)) VUINT32 dTopMask12[2][2]; + __declspec(align(16)) VUINT32 dTopMask26[2][2]; + __declspec(align(16)) VUINT32 dTopMask29[2][2]; + __declspec(align(16)) VUINT32 XScale[2][2]; +} __svml_dasinh_data_internal; +#endif +__svml_dasinh_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 16 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 16 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 16 + .quad 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 16 + .quad 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 16 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 16 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 16 + .quad 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 16 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 16 + .quad 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 16 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 16 + .quad 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 16 + .quad 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 16 + .quad 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 16 + .quad 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 16 + .quad 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 16 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + /*== dBigThreshold ==*/ + .align 16 + .quad 0x41D0000000000000, 0x41D0000000000000 + /*== dC2 ==*/ + .align 16 + .quad 0x3FD8000000000000, 0x3FD8000000000000 + /*== dC3 ==*/ + .align 16 + .quad 0x3FD4000000000000, 0x3FD4000000000000 + /*== dC4 ==*/ + .align 16 + .quad 0x3FD1800000000000, 0x3FD1800000000000 + /*== dC5 ==*/ + .align 16 + .quad 0x3FCF800000000000, 0x3FCF800000000000 + /*== dHalf ==*/ + .align 16 + .quad 0x3FE0000000000000, 0x3FE0000000000000 + /*== dLargestFinite ==*/ + .align 16 + .quad 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF + /*== dLittleThreshold ==*/ + .align 16 + .quad 0x3F60000000000000, 0x3F60000000000000 + /*== dSign ==*/ + .align 16 + .quad 0x8000000000000000, 0x8000000000000000 + /*== dThirtyOne ==*/ + .align 16 + .quad 0x403F000000000000, 0x403F000000000000 + /*== dTopMask12 ==*/ + .align 16 + .quad 0xFFFFFE0000000000, 0xFFFFFE0000000000 + /*== dTopMask26 ==*/ + .align 16 + .quad 0xFFFFFFFFF8000000, 0xFFFFFFFFF8000000 + /*== dTopMask29 ==*/ + .align 16 + .quad 0xFFFFFFFFFF000000, 0xFFFFFFFFFF000000 + /*== XScale ==*/ + .align 16 + .quad 0x3E10000000000000, 0x3E10000000000000 + .align 16 + .type __svml_dasinh_data_internal,@object + .size __svml_dasinh_data_internal,.-__svml_dasinh_data_internal + .align 16 + +.FLT_30: + .long 0x00000000,0x43380000,0x00000000,0x43380000 + .type .FLT_30,@object + .size .FLT_30,16 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core-sse.S new file mode 100644 index 0000000000..903b5f0fb5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized asinh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN4v_asinh _ZGVdN4v_asinh_sse_wrapper +#include "../svml_d_asinh4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core.c new file mode 100644 index 0000000000..e7acd032b5 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized asinh, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN4v_asinh +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN4v_asinh, __GI__ZGVdN4v_asinh, __redirect__ZGVdN4v_asinh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core_avx2.S new file mode 100644 index 0000000000..bf7a0f339b --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh4_core_avx2.S @@ -0,0 +1,1598 @@ +/* Function asinh vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute asinh(x) as log(x + sqrt(x*x + 1)) + * + * Special cases: + * + * asinh(NaN) = quiet NaN, and raise invalid exception + * asinh(INF) = that INF + * asinh(0) = that 0 + * + */ + +/* Offsets for data table __svml_dasinh_data_internal + */ +#define Log_HA_table 0 +#define Log_LA_table 8224 +#define poly_coeff 12352 +#define ExpMask 12480 +#define Two10 12512 +#define MinLog1p 12544 +#define MaxLog1p 12576 +#define One 12608 +#define SgnMask 12640 +#define XThreshold 12672 +#define XhMask 12704 +#define Threshold 12736 +#define Bias 12768 +#define Bias1 12800 +#define ExpMask0 12832 +#define ExpMask2 12864 +#define L2 12896 +#define dBigThreshold 12928 +#define dC2 12960 +#define dC3 12992 +#define dC4 13024 +#define dC5 13056 +#define dHalf 13088 +#define dLargestFinite 13120 +#define dLittleThreshold 13152 +#define dSign 13184 +#define dThirtyOne 13216 +#define dTopMask12 13248 +#define dTopMask29 13280 +#define XScale 13312 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN4v_asinh_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + lea -4218848+__svml_dasinh_data_internal(%rip), %r8 + vmovapd %ymm0, %ymm13 + vmovupd SgnMask+__svml_dasinh_data_internal(%rip), %ymm9 + +/* Load the constant 1 and a sign mask */ + vmovupd One+__svml_dasinh_data_internal(%rip), %ymm12 + +/* No need to split X when FMA is available in hardware. */ + vmulpd %ymm13, %ymm13, %ymm8 + +/* + * Get the absolute value of the input, since we will exploit antisymmetry + * and mostly assume X >= 0 in the core computation + */ + vandpd %ymm9, %ymm13, %ymm10 + +/* + * Check whether the input is finite, by checking |X| <= MaxFloat + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN <= MaxFloat) + */ + vcmpnle_uqpd dLargestFinite+__svml_dasinh_data_internal(%rip), %ymm10, %ymm14 + +/* + * Finally, express Y + W = X^2 + 1 accurately where Y has <= 29 bits. + * If |X| <= 1 then |XHi| <= 1 and so |X2Hi| <= 1, so we can treat 1 + * as the dominant component in the compensated summation. Otherwise, + * if |X| >= 1, then since X2Hi only has 52 significant bits, the basic + * addition will be exact anyway until we get to |X| >= 2^53. But by + * that time the log function is well-conditioned enough that the + * rounding error doesn't matter. Hence we can treat 1 as dominant even + * if it literally isn't. + */ + vaddpd %ymm8, %ymm12, %ymm5 + +/* + * The following computation can go wrong for very large X, basically + * because X^2 overflows. But for large X we have + * asinh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when do do this. + */ + vcmplt_oqpd dBigThreshold+__svml_dasinh_data_internal(%rip), %ymm10, %ymm11 + vsubpd %ymm5, %ymm12, %ymm15 + vmovmskpd %ymm14, %eax + vandpd dTopMask29+__svml_dasinh_data_internal(%rip), %ymm5, %ymm14 + +/* + * Compute R = 1/sqrt(Y + W) * (1 + d) + * Force R to <= 12 significant bits in case it isn't already + * This means that R * Y and R^2 * Y are exactly representable. + */ + vcvtpd2ps %ymm14, %xmm1 + vaddpd %ymm15, %ymm8, %ymm0 + vsubpd %ymm14, %ymm5, %ymm2 + vrsqrtps %xmm1, %xmm3 + vmovapd %ymm13, %ymm7 + vfmsub213pd %ymm8, %ymm13, %ymm7 + vcvtps2pd %xmm3, %ymm6 + vaddpd %ymm0, %ymm7, %ymm4 + +/* + * Unfortunately, we can still be in trouble if |X| <= 2^-10, since + * the absolute error 2^-(12+53)-ish in sqrt(1 + X^2) gets scaled up + * by 1/X and comes close to our threshold. Hence if |X| <= 2^-9, + * perform an alternative computation + * sqrt(1 + X^2) - 1 = X^2/2 - X^4/8 + X^6/16 + * X2 = X^2 + */ + vaddpd %ymm7, %ymm8, %ymm7 + vaddpd %ymm2, %ymm4, %ymm15 + +/* + * Now 1 / (1 + d) + * = 1 / (1 + (sqrt(1 - e) - 1)) + * = 1 / sqrt(1 - e) + * = 1 + 1/2 * e + 3/8 * e^2 + 5/16 * e^3 + 35/128 * e^4 + + * 63/256 * e^5 + 231/1024 * e^6 + .... + * So compute the first five nonconstant terms of that, so that + * we have a relative correction (1 + Corr) to apply to S etc. + * C1 = 1/2 + * C2 = 3/8 + * C3 = 5/16 + * C4 = 35/128 + * C5 = 63/256 + */ + vmovupd dC5+__svml_dasinh_data_internal(%rip), %ymm4 + vandpd dTopMask12+__svml_dasinh_data_internal(%rip), %ymm6, %ymm0 + +/* + * Compute S = (Y/sqrt(Y + W)) * (1 + d) + * and T = (W/sqrt(Y + W)) * (1 + d) + * so that S + T = sqrt(Y + W) * (1 + d) + * S is exact, and the rounding error in T is OK. + */ + vmulpd %ymm0, %ymm14, %ymm3 + vmulpd %ymm15, %ymm0, %ymm1 + vmovupd dHalf+__svml_dasinh_data_internal(%rip), %ymm6 + vsubpd %ymm12, %ymm3, %ymm14 + +/* + * Obtain sqrt(1 + X^2) - 1 in two pieces + * sqrt(1 + X^2) - 1 + * = sqrt(Y + W) - 1 + * = (S + T) * (1 + Corr) - 1 + * = [S - 1] + [T + (S + T) * Corr] + * We need a compensated summation for the last part. We treat S - 1 + * as the larger part; it certainly is until about X < 2^-4, and in that + * case, the error is affordable since X dominates over sqrt(1 + X^2) - 1 + * Final sum is dTmp5 (hi) + dTmp7 (lo) + */ + vaddpd %ymm1, %ymm3, %ymm2 + +/* + * Compute e = -(2 * d + d^2) + * The first FMR is exact, and the rounding error in the other is acceptable + * since d and e are ~ 2^-12 + */ + vmovapd %ymm12, %ymm5 + vfnmadd231pd %ymm3, %ymm0, %ymm5 + vfnmadd231pd %ymm1, %ymm0, %ymm5 + vfmadd213pd dC4+__svml_dasinh_data_internal(%rip), %ymm5, %ymm4 + vfmadd213pd dC3+__svml_dasinh_data_internal(%rip), %ymm5, %ymm4 + vfmadd213pd dC2+__svml_dasinh_data_internal(%rip), %ymm5, %ymm4 + vfmadd213pd %ymm6, %ymm5, %ymm4 + vmulpd %ymm4, %ymm5, %ymm0 + vfmadd213pd %ymm1, %ymm2, %ymm0 + +/* Now multiplex the two possible computations */ + vcmple_oqpd dLittleThreshold+__svml_dasinh_data_internal(%rip), %ymm10, %ymm2 + vaddpd %ymm14, %ymm0, %ymm15 + +/* dX2over2 = X^2/2 */ + vmulpd %ymm7, %ymm6, %ymm0 + +/* dX4over4 = X^4/4 */ + vmulpd %ymm0, %ymm0, %ymm8 + +/* dX46 = -X^4/4 + X^6/8 */ + vfmsub231pd %ymm0, %ymm8, %ymm8 + +/* dX46over2 = -X^4/8 + x^6/16 */ + vmulpd %ymm8, %ymm6, %ymm5 + +/* 2^ (-10-exp(X) ) */ + vmovupd ExpMask2+__svml_dasinh_data_internal(%rip), %ymm8 + vaddpd %ymm5, %ymm0, %ymm4 + vblendvpd %ymm2, %ymm4, %ymm15, %ymm1 + +/* + * Now do another compensated sum to add |X| + [sqrt(1 + X^2) - 1]. + * It's always safe to assume |X| is larger. + * This is the final 2-part argument to the log1p function + */ + vaddpd %ymm1, %ymm10, %ymm3 + +/* Now multiplex to the case X = 2^-30 * |input|, Xl = dL = 0 in the "big" case. */ + vmulpd XScale+__svml_dasinh_data_internal(%rip), %ymm10, %ymm10 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + vmaxpd %ymm3, %ymm12, %ymm6 + vminpd %ymm3, %ymm12, %ymm7 + vandpd %ymm9, %ymm3, %ymm9 + vcmplt_oqpd XThreshold+__svml_dasinh_data_internal(%rip), %ymm9, %ymm0 + vaddpd %ymm7, %ymm6, %ymm5 + vorpd XhMask+__svml_dasinh_data_internal(%rip), %ymm0, %ymm4 + vandpd %ymm4, %ymm5, %ymm1 + vblendvpd %ymm11, %ymm1, %ymm10, %ymm5 + vsubpd %ymm1, %ymm6, %ymm2 + +/* exponent bits */ + vpsrlq $20, %ymm5, %ymm10 + vaddpd %ymm2, %ymm7, %ymm3 + +/* + * Now resume the main code. + * preserve mantissa, set input exponent to 2^(-10) + */ + vandpd ExpMask+__svml_dasinh_data_internal(%rip), %ymm5, %ymm0 + vorpd Two10+__svml_dasinh_data_internal(%rip), %ymm0, %ymm2 + +/* reciprocal approximation good to at least 11 bits */ + vcvtpd2ps %ymm2, %xmm6 + vrcpps %xmm6, %xmm7 + vcvtps2pd %xmm7, %ymm15 + +/* exponent of X needed to scale Xl */ + vandps ExpMask0+__svml_dasinh_data_internal(%rip), %ymm5, %ymm9 + vpsubq %ymm9, %ymm8, %ymm0 + vandpd %ymm11, %ymm3, %ymm4 + +/* round reciprocal to nearest integer, will have 1+9 mantissa bits */ + vroundpd $0, %ymm15, %ymm3 + +/* scale DblRcp */ + vmulpd %ymm0, %ymm3, %ymm2 + +/* argument reduction */ + vfmsub213pd %ymm12, %ymm2, %ymm5 + vmulpd %ymm2, %ymm4, %ymm12 + vmovupd poly_coeff+64+__svml_dasinh_data_internal(%rip), %ymm2 + vaddpd %ymm12, %ymm5, %ymm5 + vfmadd213pd poly_coeff+96+__svml_dasinh_data_internal(%rip), %ymm5, %ymm2 + vmulpd %ymm5, %ymm5, %ymm4 + vextractf128 $1, %ymm10, %xmm14 + vshufps $221, %xmm14, %xmm10, %xmm1 + +/* biased exponent in DP format */ + vcvtdq2pd %xmm1, %ymm7 + +/* exponent*log(2.0) */ + vmovupd Threshold+__svml_dasinh_data_internal(%rip), %ymm10 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + vaddpd dThirtyOne+__svml_dasinh_data_internal(%rip), %ymm7, %ymm6 + vblendvpd %ymm11, %ymm7, %ymm6, %ymm1 + +/* + * prepare table index + * table lookup + */ + vpsrlq $40, %ymm3, %ymm11 + vcmplt_oqpd %ymm3, %ymm10, %ymm3 + vandpd Bias+__svml_dasinh_data_internal(%rip), %ymm3, %ymm14 + vorpd Bias1+__svml_dasinh_data_internal(%rip), %ymm14, %ymm15 + vsubpd %ymm15, %ymm1, %ymm1 + vmulpd L2+__svml_dasinh_data_internal(%rip), %ymm1, %ymm3 + +/* polynomial */ + vmovupd poly_coeff+__svml_dasinh_data_internal(%rip), %ymm1 + vfmadd213pd poly_coeff+32+__svml_dasinh_data_internal(%rip), %ymm5, %ymm1 + vfmadd213pd %ymm2, %ymm4, %ymm1 + +/* reconstruction */ + vfmadd213pd %ymm5, %ymm4, %ymm1 + vextractf128 $1, %ymm11, %xmm7 + vmovd %xmm11, %edx + vmovd %xmm7, %esi + movslq %edx, %rdx + vpextrd $2, %xmm11, %ecx + movslq %esi, %rsi + vpextrd $2, %xmm7, %edi + movslq %ecx, %rcx + movslq %edi, %rdi + vmovsd (%r8,%rdx), %xmm0 + vmovsd (%r8,%rsi), %xmm8 + vmovhpd (%r8,%rcx), %xmm0, %xmm6 + vmovhpd (%r8,%rdi), %xmm8, %xmm9 + vinsertf128 $1, %xmm9, %ymm6, %ymm0 + vaddpd %ymm1, %ymm0, %ymm0 + vaddpd %ymm0, %ymm3, %ymm7 + +/* Finally, reincorporate the original sign. */ + vandpd dSign+__svml_dasinh_data_internal(%rip), %ymm13, %ymm6 + vxorpd %ymm7, %ymm6, %ymm0 + testl %eax, %eax + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 eax ymm0 ymm13 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovupd %ymm13, 32(%rsp) + vmovupd %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 eax ymm0 + + xorl %edx, %edx + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovupd 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 32(%rsp,%r14,8), %xmm0 + call asinh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 64(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN4v_asinh_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_dasinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 Log_HA_table[(1<<10)+2][2]; + __declspec(align(32)) VUINT32 Log_LA_table[(1<<9)+1][2]; + __declspec(align(32)) VUINT32 poly_coeff[4][4][2]; + __declspec(align(32)) VUINT32 ExpMask[4][2]; + __declspec(align(32)) VUINT32 Two10[4][2]; + __declspec(align(32)) VUINT32 MinLog1p[4][2]; + __declspec(align(32)) VUINT32 MaxLog1p[4][2]; + __declspec(align(32)) VUINT32 One[4][2]; + __declspec(align(32)) VUINT32 SgnMask[4][2]; + __declspec(align(32)) VUINT32 XThreshold[4][2]; + __declspec(align(32)) VUINT32 XhMask[4][2]; + __declspec(align(32)) VUINT32 Threshold[4][2]; + __declspec(align(32)) VUINT32 Bias[4][2]; + __declspec(align(32)) VUINT32 Bias1[4][2]; + __declspec(align(32)) VUINT32 ExpMask0[4][2]; + __declspec(align(32)) VUINT32 ExpMask2[4][2]; + __declspec(align(32)) VUINT32 L2[4][2]; + __declspec(align(32)) VUINT32 dBigThreshold[4][2]; + __declspec(align(32)) VUINT32 dC2[4][2]; + __declspec(align(32)) VUINT32 dC3[4][2]; + __declspec(align(32)) VUINT32 dC4[4][2]; + __declspec(align(32)) VUINT32 dC5[4][2]; + __declspec(align(32)) VUINT32 dHalf[4][2]; + __declspec(align(32)) VUINT32 dLargestFinite[4][2]; + __declspec(align(32)) VUINT32 dLittleThreshold[4][2]; + __declspec(align(32)) VUINT32 dSign[4][2]; + __declspec(align(32)) VUINT32 dThirtyOne[4][2]; + __declspec(align(32)) VUINT32 dTopMask12[4][2]; + __declspec(align(32)) VUINT32 dTopMask29[4][2]; + __declspec(align(32)) VUINT32 XScale[4][2]; +} __svml_dasinh_data_internal; +#endif +__svml_dasinh_data_internal: + /* Log_HA_table */ + .quad 0xc086232bdd7a8300, 0xbe1ce91eef3fb100 + .quad 0xc086232fdc7ad828, 0xbe1cefcffda73b6a + .quad 0xc0862333d97d2ba0, 0xbe1cef406748f1ff + .quad 0xc0862337d48378e0, 0xbe1cef2a9429925a + .quad 0xc086233bcd8fb878, 0xbe1cf138d17ebecb + .quad 0xc086233fc4a3e018, 0xbe1ceff2dbbbb29e + .quad 0xc0862343b9c1e270, 0xbe1cf1a42aae437b + .quad 0xc0862347acebaf68, 0xbe1cef3b152048af + .quad 0xc086234b9e2333f0, 0xbe1cef20e127805e + .quad 0xc086234f8d6a5a30, 0xbe1cf00ad6052cf4 + .quad 0xc08623537ac30980, 0xbe1cefc4642ee597 + .quad 0xc0862357662f2660, 0xbe1cf1f277d36e16 + .quad 0xc086235b4fb092a0, 0xbe1ceed009e8d8e6 + .quad 0xc086235f37492d28, 0xbe1cf1e4038cb362 + .quad 0xc08623631cfad250, 0xbe1cf0b0873b8557 + .quad 0xc086236700c75b98, 0xbe1cf15bb3227c0b + .quad 0xc086236ae2b09fe0, 0xbe1cf151ef8ca9ed + .quad 0xc086236ec2b87358, 0xbe1cefe1dc2cd2ed + .quad 0xc0862372a0e0a780, 0xbe1cf0d1eec5454f + .quad 0xc08623767d2b0b48, 0xbe1ceeefd570bbce + .quad 0xc086237a57996af0, 0xbe1cee99ae91b3a7 + .quad 0xc086237e302d9028, 0xbe1cf0412830fbd1 + .quad 0xc086238206e94218, 0xbe1ceee898588610 + .quad 0xc0862385dbce4548, 0xbe1cee9a1fbcaaea + .quad 0xc0862389aede5bc0, 0xbe1ceed8e7cc1ad6 + .quad 0xc086238d801b4500, 0xbe1cf10c8d059da6 + .quad 0xc08623914f86be18, 0xbe1ceee6c63a8165 + .quad 0xc08623951d228180, 0xbe1cf0c3592d2ff1 + .quad 0xc0862398e8f04758, 0xbe1cf0026cc4cb1b + .quad 0xc086239cb2f1c538, 0xbe1cf15d48d8e670 + .quad 0xc08623a07b28ae60, 0xbe1cef359363787c + .quad 0xc08623a44196b390, 0xbe1cefdf1ab2e82c + .quad 0xc08623a8063d8338, 0xbe1cefe43c02aa84 + .quad 0xc08623abc91ec960, 0xbe1cf044f5ae35b7 + .quad 0xc08623af8a3c2fb8, 0xbe1cf0b0b4001e1b + .quad 0xc08623b349975d98, 0xbe1cf1bae76dfbcf + .quad 0xc08623b70731f810, 0xbe1cef0a72e13a62 + .quad 0xc08623bac30da1c8, 0xbe1cf184007d2b6b + .quad 0xc08623be7d2bfb40, 0xbe1cf16f4b239e98 + .quad 0xc08623c2358ea2a0, 0xbe1cf0976acada87 + .quad 0xc08623c5ec3733d0, 0xbe1cf066318a16ff + .quad 0xc08623c9a1274880, 0xbe1ceffaa7148798 + .quad 0xc08623cd54607820, 0xbe1cf23ab02e9b6e + .quad 0xc08623d105e45800, 0xbe1cefdfef7d4fde + .quad 0xc08623d4b5b47b20, 0xbe1cf17fece44f2b + .quad 0xc08623d863d27270, 0xbe1cf18f907d0d7c + .quad 0xc08623dc103fccb0, 0xbe1cee61fe072c98 + .quad 0xc08623dfbafe1668, 0xbe1cf022dd891e2f + .quad 0xc08623e3640eda20, 0xbe1ceecc1daf4358 + .quad 0xc08623e70b73a028, 0xbe1cf0173c4fa380 + .quad 0xc08623eab12deec8, 0xbe1cf16a2150c2f4 + .quad 0xc08623ee553f4a30, 0xbe1cf1bf980b1f4b + .quad 0xc08623f1f7a93480, 0xbe1cef8b731663c2 + .quad 0xc08623f5986d2dc0, 0xbe1cee9a664d7ef4 + .quad 0xc08623f9378cb3f0, 0xbe1cf1eda2af6400 + .quad 0xc08623fcd5094320, 0xbe1cf1923f9d68d7 + .quad 0xc086240070e45548, 0xbe1cf0747cd3e03a + .quad 0xc08624040b1f6260, 0xbe1cf22ee855bd6d + .quad 0xc0862407a3bbe078, 0xbe1cf0d57360c00b + .quad 0xc086240b3abb4398, 0xbe1ceebc815cd575 + .quad 0xc086240ed01efdd0, 0xbe1cf03bfb970951 + .quad 0xc086241263e87f50, 0xbe1cf16e74768529 + .quad 0xc0862415f6193658, 0xbe1cefec64b8becb + .quad 0xc086241986b28f30, 0xbe1cf0838d210baa + .quad 0xc086241d15b5f448, 0xbe1cf0ea86e75b11 + .quad 0xc0862420a324ce28, 0xbe1cf1708d11d805 + .quad 0xc08624242f008380, 0xbe1ceea988c5a417 + .quad 0xc0862427b94a7910, 0xbe1cef166a7bbca5 + .quad 0xc086242b420411d0, 0xbe1cf0c9d9e86a38 + .quad 0xc086242ec92eaee8, 0xbe1cef0946455411 + .quad 0xc08624324ecbaf98, 0xbe1cefea60907739 + .quad 0xc0862435d2dc7160, 0xbe1cf1ed0934ce42 + .quad 0xc086243955624ff8, 0xbe1cf191ba746c7d + .quad 0xc086243cd65ea548, 0xbe1ceeec78cf2a7e + .quad 0xc086244055d2c968, 0xbe1cef345284c119 + .quad 0xc0862443d3c012b8, 0xbe1cf24f77355219 + .quad 0xc08624475027d5e8, 0xbe1cf05bf087e114 + .quad 0xc086244acb0b65d0, 0xbe1cef3504a32189 + .quad 0xc086244e446c1398, 0xbe1ceff54b2a406f + .quad 0xc0862451bc4b2eb8, 0xbe1cf0757d54ed4f + .quad 0xc086245532aa04f0, 0xbe1cf0c8099fdfd5 + .quad 0xc0862458a789e250, 0xbe1cf0b173796a31 + .quad 0xc086245c1aec1138, 0xbe1cf11d8734540d + .quad 0xc086245f8cd1da60, 0xbe1cf1916a723ceb + .quad 0xc0862462fd3c84d8, 0xbe1cf19a911e1da7 + .quad 0xc08624666c2d5608, 0xbe1cf23a9ef72e4f + .quad 0xc0862469d9a591c0, 0xbe1cef503d947663 + .quad 0xc086246d45a67a18, 0xbe1cf0fceeb1a0b2 + .quad 0xc0862470b0314fa8, 0xbe1cf107e27e4fbc + .quad 0xc086247419475160, 0xbe1cf03dd9922331 + .quad 0xc086247780e9bc98, 0xbe1cefce1a10e129 + .quad 0xc086247ae719cd18, 0xbe1ceea47f73c4f6 + .quad 0xc086247e4bd8bd10, 0xbe1ceec0ac56d100 + .quad 0xc0862481af27c528, 0xbe1cee8a6593278a + .quad 0xc086248511081c70, 0xbe1cf2231dd9dec7 + .quad 0xc0862488717af888, 0xbe1cf0b4b8ed7da8 + .quad 0xc086248bd0818d68, 0xbe1cf1bd8d835002 + .quad 0xc086248f2e1d0d98, 0xbe1cf259acc107f4 + .quad 0xc08624928a4eaa20, 0xbe1cee897636b00c + .quad 0xc0862495e5179270, 0xbe1cee757f20c326 + .quad 0xc08624993e78f490, 0xbe1cefafd3aa54a4 + .quad 0xc086249c9673fd10, 0xbe1cee7298d38b97 + .quad 0xc086249fed09d6f8, 0xbe1ceedc158d4ceb + .quad 0xc08624a3423babe0, 0xbe1cf2282987cb2e + .quad 0xc08624a6960aa400, 0xbe1cefe7381ecc4b + .quad 0xc08624a9e877e600, 0xbe1cef328dbbce80 + .quad 0xc08624ad39849728, 0xbe1cefde45f3cc71 + .quad 0xc08624b08931db58, 0xbe1cefa8b89433b9 + .quad 0xc08624b3d780d500, 0xbe1cef6773c0b139 + .quad 0xc08624b72472a528, 0xbe1cf031c931c11f + .quad 0xc08624ba70086b78, 0xbe1cf088f49275e7 + .quad 0xc08624bdba434630, 0xbe1cf17de0eaa86d + .quad 0xc08624c103245238, 0xbe1cefd492f1ba75 + .quad 0xc08624c44aacab08, 0xbe1cf1253e154466 + .quad 0xc08624c790dd6ad0, 0xbe1cf0fb09ee6d55 + .quad 0xc08624cad5b7aa58, 0xbe1cf1f08dd048fe + .quad 0xc08624ce193c8120, 0xbe1ceeca0809697f + .quad 0xc08624d15b6d0538, 0xbe1cef8d5662d968 + .quad 0xc08624d49c4a4b78, 0xbe1cee97b556ed78 + .quad 0xc08624d7dbd56750, 0xbe1cf1b14b6acb75 + .quad 0xc08624db1a0f6b00, 0xbe1cef1e860623f2 + .quad 0xc08624de56f96758, 0xbe1ceeaf4d156f3d + .quad 0xc08624e192946bf0, 0xbe1ceecc12b400ed + .quad 0xc08624e4cce18710, 0xbe1cf180c40c794f + .quad 0xc08624e805e1c5c8, 0xbe1cf185a08f7f65 + .quad 0xc08624eb3d9633d8, 0xbe1cef45fc924078 + .quad 0xc08624ee73ffdbb0, 0xbe1cf1e4f457f32a + .quad 0xc08624f1a91fc6a0, 0xbe1cf040147b8a5a + .quad 0xc08624f4dcf6fc98, 0xbe1cf1effca0dfb2 + .quad 0xc08624f80f868468, 0xbe1cf0470146e5bc + .quad 0xc08624fb40cf6390, 0xbe1cef4dd186e501 + .quad 0xc08624fe70d29e60, 0xbe1ceebe257f66c7 + .quad 0xc08625019f9137f0, 0xbe1ceefb7a1c395c + .quad 0xc0862504cd0c3220, 0xbe1cf209dedfed8c + .quad 0xc0862507f9448db0, 0xbe1cf082da464994 + .quad 0xc086250b243b4a18, 0xbe1cee88694a73cf + .quad 0xc086250e4df165a0, 0xbe1cf0b61e8f0531 + .quad 0xc08625117667dd78, 0xbe1cf1106599c962 + .quad 0xc08625149d9fad98, 0xbe1ceff1ee88af1f + .quad 0xc0862517c399d0c8, 0xbe1cf0f746994ef6 + .quad 0xc086251ae85740b8, 0xbe1cefe8a1d077e4 + .quad 0xc086251e0bd8f5e0, 0xbe1cf1a1da036092 + .quad 0xc08625212e1fe7a8, 0xbe1cf0f8a7786fcd + .quad 0xc08625244f2d0c48, 0xbe1cefa1174a07a7 + .quad 0xc08625276f0158d8, 0xbe1cef1043aa5b25 + .quad 0xc086252a8d9dc150, 0xbe1cf15d521c169d + .quad 0xc086252dab033898, 0xbe1cf220bba8861f + .quad 0xc0862530c732b078, 0xbe1cef51e310eae2 + .quad 0xc0862533e22d1988, 0xbe1cf222fcedd8ae + .quad 0xc0862536fbf36370, 0xbe1cefdb4da4bda8 + .quad 0xc086253a14867ca0, 0xbe1ceeafc1112171 + .quad 0xc086253d2be75280, 0xbe1cee99dfb4b408 + .quad 0xc08625404216d160, 0xbe1cf22d2536f06b + .quad 0xc08625435715e498, 0xbe1cef6abbf2e268 + .quad 0xc08625466ae57648, 0xbe1cf093a14789f5 + .quad 0xc08625497d866fa0, 0xbe1cf0f93655603c + .quad 0xc086254c8ef9b8b8, 0xbe1cf1cc40c9aafc + .quad 0xc086254f9f4038a8, 0xbe1ceeea5f4e9157 + .quad 0xc0862552ae5ad568, 0xbe1cefa9f52d4997 + .quad 0xc0862555bc4a7400, 0xbe1cefa490a638ff + .quad 0xc0862558c90ff868, 0xbe1cef7fcf797d6f + .quad 0xc086255bd4ac4590, 0xbe1cf1b4c51113c9 + .quad 0xc086255edf203d78, 0xbe1cef55e5b4a55d + .quad 0xc0862561e86cc100, 0xbe1cf0d37a25f9dc + .quad 0xc0862564f092b028, 0xbe1ceebe9efc19d9 + .quad 0xc0862567f792e9d8, 0xbe1cee8ad30a57b5 + .quad 0xc086256afd6e4c08, 0xbe1cef4e1817b90b + .quad 0xc086256e0225b3b8, 0xbe1cee7fa9229996 + .quad 0xc086257105b9fce0, 0xbe1cf0b54963d945 + .quad 0xc0862574082c0298, 0xbe1cee5f2f3c7995 + .quad 0xc0862577097c9ee0, 0xbe1cf0828e303a2c + .quad 0xc086257a09acaae0, 0xbe1cf172c3078947 + .quad 0xc086257d08bcfec0, 0xbe1cf189252afa22 + .quad 0xc086258006ae71b8, 0xbe1cefdb80426923 + .quad 0xc08625830381da08, 0xbe1ceef1391a0372 + .quad 0xc0862585ff380d00, 0xbe1cf17720c78d13 + .quad 0xc0862588f9d1df18, 0xbe1ceef1f9027d83 + .quad 0xc086258bf35023b8, 0xbe1cf06fac99dec9 + .quad 0xc086258eebb3ad78, 0xbe1cf1373eeb45c0 + .quad 0xc0862591e2fd4e00, 0xbe1cef777536bb81 + .quad 0xc0862594d92dd600, 0xbe1cf0f43ca40766 + .quad 0xc0862597ce461558, 0xbe1cefb2cfc6766b + .quad 0xc086259ac246daf0, 0xbe1ceea49e64ffa2 + .quad 0xc086259db530f4c8, 0xbe1cf250fa457dec + .quad 0xc08625a0a7053018, 0xbe1cf17d8bb2a44e + .quad 0xc08625a397c45918, 0xbe1cf1d5906d54b7 + .quad 0xc08625a6876f3b30, 0xbe1cf08fe7b31780 + .quad 0xc08625a97606a0e0, 0xbe1cef13edfc9d11 + .quad 0xc08625ac638b53c8, 0xbe1cef9d2b107219 + .quad 0xc08625af4ffe1cb0, 0xbe1cf1ddd4ff6160 + .quad 0xc08625b23b5fc390, 0xbe1cefa02a996495 + .quad 0xc08625b525b10f68, 0xbe1cf166a7e37ee5 + .quad 0xc08625b80ef2c680, 0xbe1cef0b171068a5 + .quad 0xc08625baf725ae28, 0xbe1cf05c80779283 + .quad 0xc08625bdde4a8af0, 0xbe1cf1bbfbffb889 + .quad 0xc08625c0c4622090, 0xbe1cf0b8666c0124 + .quad 0xc08625c3a96d31e0, 0xbe1cf0a8fcf47a86 + .quad 0xc08625c68d6c80f0, 0xbe1cef46e18cb092 + .quad 0xc08625c97060cef0, 0xbe1cf1458a350efb + .quad 0xc08625cc524adc58, 0xbe1ceeea1dadce12 + .quad 0xc08625cf332b68b0, 0xbe1cf0a1bfdc44c7 + .quad 0xc08625d2130332d0, 0xbe1cef96d02da73e + .quad 0xc08625d4f1d2f8a8, 0xbe1cf2451c3c7701 + .quad 0xc08625d7cf9b7778, 0xbe1cf10d08f83812 + .quad 0xc08625daac5d6ba0, 0xbe1ceec5b4895c5e + .quad 0xc08625dd881990b0, 0xbe1cf14e1325c5e4 + .quad 0xc08625e062d0a188, 0xbe1cf21d0904be12 + .quad 0xc08625e33c835838, 0xbe1ceed0839bcf21 + .quad 0xc08625e615326df0, 0xbe1cf1bb944889d2 + .quad 0xc08625e8ecde9b48, 0xbe1cee738e85eece + .quad 0xc08625ebc38897e0, 0xbe1cf25c2bc6ef12 + .quad 0xc08625ee99311ac8, 0xbe1cf132b70a41ad + .quad 0xc08625f16dd8da28, 0xbe1cf1984236a6e3 + .quad 0xc08625f441808b78, 0xbe1cf19ae74998f9 + .quad 0xc08625f71428e370, 0xbe1cef3e175d61a1 + .quad 0xc08625f9e5d295f8, 0xbe1cf101f9868fd9 + .quad 0xc08625fcb67e5658, 0xbe1cee69db83dcd2 + .quad 0xc08625ff862cd6f8, 0xbe1cf081b636af51 + .quad 0xc086260254dec9a8, 0xbe1cee62c7d59b3e + .quad 0xc08626052294df58, 0xbe1cf1b745c57716 + .quad 0xc0862607ef4fc868, 0xbe1cef3d2800ea23 + .quad 0xc086260abb103458, 0xbe1cef480ff1acd2 + .quad 0xc086260d85d6d200, 0xbe1cf2424c9a17ef + .quad 0xc08626104fa44f90, 0xbe1cf12cfde90fd5 + .quad 0xc086261318795a68, 0xbe1cf21f590dd5b6 + .quad 0xc0862615e0569f48, 0xbe1cf0c50f9cd28a + .quad 0xc0862618a73cca30, 0xbe1ceedbdb520545 + .quad 0xc086261b6d2c8668, 0xbe1cf0b030396011 + .quad 0xc086261e32267e98, 0xbe1cf19917010e96 + .quad 0xc0862620f62b5cb0, 0xbe1cf07331355985 + .quad 0xc0862623b93bc9e8, 0xbe1cf01ae921a1c3 + .quad 0xc08626267b586ed0, 0xbe1cefe5cf0dbf0c + .quad 0xc08626293c81f348, 0xbe1cf01b258aeb50 + .quad 0xc086262bfcb8fe88, 0xbe1cee6b9e7f4c68 + .quad 0xc086262ebbfe3710, 0xbe1cee684a9b21c9 + .quad 0xc08626317a5242b8, 0xbe1cf1f8bcde9a8b + .quad 0xc086263437b5c6c0, 0xbe1cf1d063d36238 + .quad 0xc0862636f42967a8, 0xbe1cf1e31a19075e + .quad 0xc0862639afadc950, 0xbe1cf1d8efdf7e7d + .quad 0xc086263c6a438ef0, 0xbe1cf1812ee72dba + .quad 0xc086263f23eb5b18, 0xbe1cf1449a9a2279 + .quad 0xc0862641dca5cfb8, 0xbe1cee96edce5085 + .quad 0xc086264494738e08, 0xbe1cf06797bd03b2 + .quad 0xc08626474b5536b8, 0xbe1cef91b9b7ffc1 + .quad 0xc086264a014b69c0, 0xbe1cef4b6721278f + .quad 0xc086264cb656c678, 0xbe1cf1942925eb4a + .quad 0xc086264f6a77eba8, 0xbe1cefa2c7bc2e39 + .quad 0xc08626521daf7758, 0xbe1cf252595aceb3 + .quad 0xc0862654cffe0718, 0xbe1cee8e9ae47ec2 + .quad 0xc0862657816437a8, 0xbe1cf1bf913828fa + .quad 0xc086265a31e2a558, 0xbe1cf23475d6b366 + .quad 0xc086265ce179ebc8, 0xbe1cef8df00a922b + .quad 0xc086265f902aa5f0, 0xbe1cef279bfa43e0 + .quad 0xc08626623df56e38, 0xbe1cf080e10b8365 + .quad 0xc0862664eadade70, 0xbe1cf1a518f9b544 + .quad 0xc086266796db8fd0, 0xbe1cef9308fed9e9 + .quad 0xc086266a41f81ae8, 0xbe1ceea3ae6b19c9 + .quad 0xc086266cec3117b8, 0xbe1ceef06003d4c2 + .quad 0xc086266f95871da8, 0xbe1cf0b8457ffb0c + .quad 0xc08626723dfac390, 0xbe1cf0c526745ad6 + .quad 0xc0862674e58c9fa8, 0xbe1cf0cf91ff7b5d + .quad 0xc08626778c3d4798, 0xbe1cefe260819380 + .quad 0xc086267a320d5070, 0xbe1ceebd90aa27a3 + .quad 0xc086267cd6fd4ea8, 0xbe1cf0388121dffa + .quad 0xc086267f7b0dd630, 0xbe1cf1a3881435f1 + .quad 0xc08626821e3f7a68, 0xbe1cef28e9d9ac52 + .quad 0xc0862684c092ce08, 0xbe1cf02d300062dd + .quad 0xc086268762086350, 0xbe1cefaee1edfa35 + .quad 0xc086268a02a0cbe0, 0xbe1cf0a5a052e936 + .quad 0xc086268ca25c98d8, 0xbe1cee60a4a497ed + .quad 0xc086268f413c5ab0, 0xbe1cf0e4a5d0cf49 + .quad 0xc0862691df40a170, 0xbe1cf149235a4e6e + .quad 0xc08626947c69fc80, 0xbe1cf215180b9fcc + .quad 0xc086269718b8fac8, 0xbe1cef9b156a9840 + .quad 0xc0862699b42e2a90, 0xbe1cf054c91441be + .quad 0xc086269c4eca19a8, 0xbe1cf13ded26512c + .quad 0xc086269ee88d5550, 0xbe1cf22ea4d8ac06 + .quad 0xc08626a181786a40, 0xbe1cf2354666ee2e + .quad 0xc08626a4198be4a8, 0xbe1cefef936752b3 + .quad 0xc08626a6b0c85020, 0xbe1cf1e360a9db68 + .quad 0xc08626a9472e37d8, 0xbe1ceed6aeb812c5 + .quad 0xc08626abdcbe2650, 0xbe1cf227340b4986 + .quad 0xc08626ae7178a5b0, 0xbe1cf0215a0cbe0d + .quad 0xc08626b1055e3f70, 0xbe1cf256adf0ae26 + .quad 0xc08626b3986f7ca8, 0xbe1ceff3c67aed06 + .quad 0xc08626b62aace5c8, 0xbe1cf2159fb93652 + .quad 0xc08626b8bc1702e0, 0xbe1cf01e6dbd1c7f + .quad 0xc08626bb4cae5b60, 0xbe1cf009e75d1c0c + .quad 0xc08626bddc737648, 0xbe1ceec10a020e73 + .quad 0xc08626c06b66da08, 0xbe1cf06d5783eee7 + .quad 0xc08626c2f9890ca0, 0xbe1cf0cb8f169ffe + .quad 0xc08626c586da9388, 0xbe1cef7de2452430 + .quad 0xc08626c8135bf3b0, 0xbe1cf05da6f783ae + .quad 0xc08626ca9f0db198, 0xbe1cefcc877d681d + .quad 0xc08626cd29f05138, 0xbe1cef0531954ab3 + .quad 0xc08626cfb4045608, 0xbe1cf06b8565ea3d + .quad 0xc08626d23d4a4310, 0xbe1cefdc455d9d7e + .quad 0xc08626d4c5c29ad0, 0xbe1ceefc47e8fa64 + .quad 0xc08626d74d6ddf48, 0xbe1cf1872bf033f2 + .quad 0xc08626d9d44c9210, 0xbe1cf19d91087f9d + .quad 0xc08626dc5a5f3438, 0xbe1cf012d444c6ab + .quad 0xc08626dedfa64650, 0xbe1cf0ba528ee153 + .quad 0xc08626e164224880, 0xbe1ceeb431709788 + .quad 0xc08626e3e7d3ba60, 0xbe1cf0b9af31a6a5 + .quad 0xc08626e66abb1b28, 0xbe1cf168fb2e135b + .quad 0xc08626e8ecd8e990, 0xbe1cef9097461c93 + .quad 0xc08626eb6e2da3d0, 0xbe1cee7a434735d8 + .quad 0xc08626edeeb9c7a8, 0xbe1cf235732b86f2 + .quad 0xc08626f06e7dd280, 0xbe1cefe1510b89e6 + .quad 0xc08626f2ed7a4120, 0xbe1cf1f64b9b80ef + .quad 0xc08626f56baf9000, 0xbe1cf08f320ca339 + .quad 0xc08626f7e91e3b08, 0xbe1cf1b1de2808a1 + .quad 0xc08626fa65c6bdc0, 0xbe1cf1976d778b28 + .quad 0xc08626fce1a99338, 0xbe1ceef40a4f076f + .quad 0xc08626ff5cc73600, 0xbe1cef3e45869ce3 + .quad 0xc0862701d7202048, 0xbe1ceef601b4c9d6 + .quad 0xc086270450b4cbc0, 0xbe1cf1eaf0b57fd6 + .quad 0xc0862706c985b1c0, 0xbe1cef82a44990f3 + .quad 0xc086270941934b10, 0xbe1ceefe32981f2c + .quad 0xc086270bb8de1018, 0xbe1cefbf6f5a0445 + .quad 0xc086270e2f6678d0, 0xbe1cf18dba75792c + .quad 0xc0862710a52cfcc8, 0xbe1cf0da64ce995f + .quad 0xc08627131a321318, 0xbe1cef04ac0fb802 + .quad 0xc08627158e763268, 0xbe1cee9d4e2ad9bd + .quad 0xc086271801f9d0f8, 0xbe1cefa9b55407b5 + .quad 0xc086271a74bd64a0, 0xbe1cefe6bd329570 + .quad 0xc086271ce6c162c8, 0xbe1cef0b1205dc85 + .quad 0xc086271f58064068, 0xbe1cef092a785e3f + .quad 0xc0862721c88c7210, 0xbe1cf050dcdaac30 + .quad 0xc086272438546be8, 0xbe1cf210907ded8b + .quad 0xc0862726a75ea1b8, 0xbe1cee760be44f99 + .quad 0xc086272915ab86c0, 0xbe1ceeeee07c2bcc + .quad 0xc086272b833b8df0, 0xbe1cf06874992df5 + .quad 0xc086272df00f29d0, 0xbe1cef8fac5d4899 + .quad 0xc08627305c26cc70, 0xbe1cf1103241cc99 + .quad 0xc0862732c782e788, 0xbe1cf1d35fef83fe + .quad 0xc08627353223ec68, 0xbe1cef3ec8133e1d + .quad 0xc08627379c0a4be8, 0xbe1cef7261daccd8 + .quad 0xc086273a05367688, 0xbe1cf18656c50806 + .quad 0xc086273c6da8dc68, 0xbe1cf1c8736e049a + .quad 0xc086273ed561ed38, 0xbe1cf1f93bff4911 + .quad 0xc08627413c621848, 0xbe1cf188a4ea680c + .quad 0xc0862743a2a9cc80, 0xbe1cf1d270930c80 + .quad 0xc086274608397868, 0xbe1cf25a328c28e2 + .quad 0xc08627486d118a28, 0xbe1cf106f90aa3b8 + .quad 0xc086274ad1326f80, 0xbe1cee5e9d2e885a + .quad 0xc086274d349c95c0, 0xbe1cf1c0bac27228 + .quad 0xc086274f975069f8, 0xbe1cf1a1500f9b1c + .quad 0xc0862751f94e58c0, 0xbe1cefc30663ac44 + .quad 0xc08627545a96ce48, 0xbe1cf17123e427a2 + .quad 0xc0862756bb2a3678, 0xbe1cefb92749fea4 + .quad 0xc08627591b08fcc0, 0xbe1cefa40e1ea74a + .quad 0xc086275b7a338c40, 0xbe1cee6f4612c3e9 + .quad 0xc086275dd8aa4fa8, 0xbe1cf1c54a053627 + .quad 0xc0862760366db168, 0xbe1ceff5eb503d9e + .quad 0xc0862762937e1b70, 0xbe1cf02e47f10cee + .quad 0xc0862764efdbf768, 0xbe1ceeb06e1d0dad + .quad 0xc08627674b87ae88, 0xbe1cf10aadd6dba5 + .quad 0xc0862769a681a9c0, 0xbe1cf24e9913d30f + .quad 0xc086276c00ca51a0, 0xbe1cef47b301e312 + .quad 0xc086276e5a620e48, 0xbe1ceeb1cefc2e85 + .quad 0xc0862770b3494788, 0xbe1cf16f1fbbe011 + .quad 0xc08627730b8064e8, 0xbe1ceebdf75174c7 + .quad 0xc08627756307cd70, 0xbe1cf06e3871a0da + .quad 0xc0862777b9dfe7f0, 0xbe1cef16799fd554 + .quad 0xc086277a10091ac0, 0xbe1cf248dabf5377 + .quad 0xc086277c6583cc00, 0xbe1cf0c78d92a2cd + .quad 0xc086277eba506158, 0xbe1cf0b911b029f0 + .quad 0xc08627810e6f4028, 0xbe1cefdc24719766 + .quad 0xc086278361e0cd70, 0xbe1cefbb6562b7e7 + .quad 0xc0862785b4a56dd8, 0xbe1cf1e0afb349ec + .quad 0xc086278806bd85c0, 0xbe1cf008292e52fc + .quad 0xc086278a58297918, 0xbe1cf053073872bf + .quad 0xc086278ca8e9ab88, 0xbe1cf17a0a55a947 + .quad 0xc086278ef8fe8068, 0xbe1ceeffb0b60234 + .quad 0xc086279148685aa0, 0xbe1cf162204794a8 + .quad 0xc086279397279ce0, 0xbe1cf24cc8cb48ac + .quad 0xc0862795e53ca978, 0xbe1cf0c9be68d5c3 + .quad 0xc086279832a7e258, 0xbe1cf172cd3d7388 + .quad 0xc086279a7f69a930, 0xbe1ceea2465fbce5 + .quad 0xc086279ccb825f40, 0xbe1cf0a386d2500f + .quad 0xc086279f16f26590, 0xbe1cf1e338ddc18a + .quad 0xc08627a161ba1cd0, 0xbe1cef1f5049867f + .quad 0xc08627a3abd9e548, 0xbe1cef96c1ea8b1f + .quad 0xc08627a5f5521f00, 0xbe1cf138f6fd3c26 + .quad 0xc08627a83e2329b0, 0xbe1cf0d4fcbfdf3a + .quad 0xc08627aa864d64b0, 0xbe1cf24870c12c81 + .quad 0xc08627accdd12f18, 0xbe1cf0ae2a56348d + .quad 0xc08627af14aee7a0, 0xbe1cee8ca1a9b893 + .quad 0xc08627b15ae6eca8, 0xbe1cf20414d637b0 + .quad 0xc08627b3a0799c60, 0xbe1cf0fc6b7b12d8 + .quad 0xc08627b5e5675488, 0xbe1cf152d93c4a00 + .quad 0xc08627b829b072a0, 0xbe1cf1073f9b77c2 + .quad 0xc08627ba6d5553d8, 0xbe1cee694f97d5a4 + .quad 0xc08627bcb0565500, 0xbe1cf0456b8239d7 + .quad 0xc08627bef2b3d2b0, 0xbe1cf211497127e3 + .quad 0xc08627c1346e2930, 0xbe1cf01856c0384d + .quad 0xc08627c37585b468, 0xbe1cefa7dd05479e + .quad 0xc08627c5b5fad000, 0xbe1cef3ae8e50b93 + .quad 0xc08627c7f5cdd750, 0xbe1ceea5f32fdd3a + .quad 0xc08627ca34ff2560, 0xbe1cef424caeb8d9 + .quad 0xc08627cc738f14f0, 0xbe1cf0194d07a81f + .quad 0xc08627ceb17e0070, 0xbe1cf20f452000c1 + .quad 0xc08627d0eecc4210, 0xbe1cf00e356218e4 + .quad 0xc08627d32b7a33a0, 0xbe1cef30484b4bcb + .quad 0xc08627d567882eb0, 0xbe1ceeea11a6641b + .quad 0xc08627d7a2f68c80, 0xbe1cf13492d5bd7b + .quad 0xc08627d9ddc5a618, 0xbe1ceeb7048fad96 + .quad 0xc08627dc17f5d418, 0xbe1ceef0666f0477 + .quad 0xc08627de51876ee8, 0xbe1cf060d4b8b5c2 + .quad 0xc08627e08a7acea8, 0xbe1cf0b2a4b6ff8c + .quad 0xc08627e2c2d04b28, 0xbe1cf0e34809a875 + .quad 0xc08627e4fa883bf0, 0xbe1cf16bf74a3522 + .quad 0xc08627e731a2f848, 0xbe1cee6a24623d57 + .quad 0xc08627e96820d718, 0xbe1cefc7b4f1528e + .quad 0xc08627eb9e022f18, 0xbe1cf163051f3548 + .quad 0xc08627edd34756b8, 0xbe1cef36b3366305 + .quad 0xc08627f007f0a408, 0xbe1cf18134625550 + .quad 0xc08627f23bfe6cf0, 0xbe1cf0ec32ec1a11 + .quad 0xc08627f46f710700, 0xbe1ceeb3b64f3edc + .quad 0xc08627f6a248c778, 0xbe1cf0cd15805bc8 + .quad 0xc08627f8d4860368, 0xbe1cf20db3bddebe + .quad 0xc08627fb06290f90, 0xbe1cf25188430e25 + .quad 0xc08627fd37324070, 0xbe1ceea1713490f9 + .quad 0xc08627ff67a1ea28, 0xbe1cf159521d234c + .quad 0xc0862801977860b8, 0xbe1cf24dfe50783b + .quad 0xc0862803c6b5f7d0, 0xbe1ceef2ef89a60b + .quad 0xc0862805f55b02c8, 0xbe1cee7fc919d62c + .quad 0xc08628082367d4c0, 0xbe1cf215a7fb513a + .quad 0xc086280a50dcc0a8, 0xbe1cf0e4401c5ed4 + .quad 0xc086280c7dba1910, 0xbe1cf04ec734d256 + .quad 0xc086280eaa003050, 0xbe1cf010ad787fea + .quad 0xc0862810d5af5880, 0xbe1cee622478393d + .quad 0xc086281300c7e368, 0xbe1cf01c7482564f + .quad 0xc08628152b4a22a0, 0xbe1cf0de20d33536 + .quad 0xc086281755366778, 0xbe1cef2edae5837d + .quad 0xc08628197e8d02f0, 0xbe1cf0a345318cc9 + .quad 0xc086281ba74e45d8, 0xbe1cf20085aa34b8 + .quad 0xc086281dcf7a80c0, 0xbe1cef5fa845ad83 + .quad 0xc086281ff71203e0, 0xbe1cf050d1df69c4 + .quad 0xc08628221e151f48, 0xbe1ceffe43c035b9 + .quad 0xc0862824448422b8, 0xbe1cf14f3018d3c2 + .quad 0xc08628266a5f5dc0, 0xbe1cef0a5fbae83d + .quad 0xc08628288fa71f98, 0xbe1ceff8a95b72a1 + .quad 0xc086282ab45bb750, 0xbe1cef073aa9849b + .quad 0xc086282cd87d73a8, 0xbe1cef69b3835c02 + .quad 0xc086282efc0ca328, 0xbe1cf0bc139379a9 + .quad 0xc08628311f099420, 0xbe1cef247a9ec596 + .quad 0xc086283341749490, 0xbe1cef74bbcc488a + .quad 0xc0862835634df248, 0xbe1cef4bc42e7b8e + .quad 0xc08628378495fad0, 0xbe1cf136d4d5a810 + .quad 0xc0862839a54cfb80, 0xbe1cf0d290b24dd8 + .quad 0xc086283bc5734168, 0xbe1ceeebde8e0065 + .quad 0xc086283de5091950, 0xbe1cf1a09f60aa1e + .quad 0xc0862840040ecfe0, 0xbe1cf0803947a234 + .quad 0xc08628422284b168, 0xbe1cf0abf7638127 + .quad 0xc0862844406b0a08, 0xbe1cf0f73ee12058 + .quad 0xc08628465dc225a0, 0xbe1cf2079971b26c + .quad 0xc08628487a8a4fe0, 0xbe1cee74957564b1 + .quad 0xc086284a96c3d420, 0xbe1ceee77c1b7d43 + .quad 0xc086284cb26efd90, 0xbe1cf23addba6e09 + .quad 0xc086284ecd8c1730, 0xbe1cf199f4a1da60 + .quad 0xc0862850e81b6bb0, 0xbe1cf09fdea81393 + .quad 0xc0862853021d4588, 0xbe1cf176adb417f7 + .quad 0xc08628551b91ef00, 0xbe1cf0f64f84a8da + .quad 0xc08628573479b220, 0xbe1ceec34cf49523 + .quad 0xc08628594cd4d8a8, 0xbe1cf16d60fbe0bb + .quad 0xc086285b64a3ac40, 0xbe1cee8de7acfc7b + .quad 0xc086285d7be67630, 0xbe1ceee6256cce8d + .quad 0xc086285f929d7fa0, 0xbe1cee7d66a3d8a5 + .quad 0xc0862861a8c91170, 0xbe1cf0bef8265792 + .quad 0xc0862863be697458, 0xbe1cf097f890c6f8 + .quad 0xc0862865d37ef0c8, 0xbe1cf09502d5c3fc + .quad 0xc0862867e809cf00, 0xbe1ceeffb239dac7 + .quad 0xc0862869fc0a56f8, 0xbe1cf1fbfff95c98 + .quad 0xc086286c0f80d090, 0xbe1cefa57ad3eef7 + .quad 0xc086286e226d8348, 0xbe1cf22c58b9183d + .quad 0xc086287034d0b690, 0xbe1ceff262d0a248 + .quad 0xc086287246aab180, 0xbe1cefa7bc194186 + .quad 0xc086287457fbbb08, 0xbe1cf06782d784d9 + .quad 0xc086287668c419e0, 0xbe1cf1d44d0eaa07 + .quad 0xc086287879041490, 0xbe1cf034803c8a48 + .quad 0xc086287a88bbf158, 0xbe1cf08e84916b6f + .quad 0xc086287c97ebf650, 0xbe1cf0c4d3dc1bc7 + .quad 0xc086287ea6946958, 0xbe1cefb1e4625943 + .quad 0xc0862880b4b59010, 0xbe1cf143efdd1fd0 + .quad 0xc0862882c24faff8, 0xbe1cee9896d016da + .quad 0xc0862884cf630e38, 0xbe1cf2186072f2cc + .quad 0xc0862886dbefeff0, 0xbe1cef9217633d34 + .quad 0xc0862888e7f699e0, 0xbe1cf05603549486 + .quad 0xc086288af37750b0, 0xbe1cef50fff513d3 + .quad 0xc086288cfe7258c0, 0xbe1cf127713b32d0 + .quad 0xc086288f08e7f650, 0xbe1cf05015520f3d + .quad 0xc086289112d86d58, 0xbe1cf12eb458b26f + .quad 0xc08628931c4401a8, 0xbe1cf22eae2887ed + .quad 0xc0862895252af6e0, 0xbe1cefdd6656dd2d + .quad 0xc08628972d8d9058, 0xbe1cf1048ea4e646 + .quad 0xc0862899356c1150, 0xbe1ceec4501167e9 + .quad 0xc086289b3cc6bcb8, 0xbe1cf0ad52becc3f + .quad 0xc086289d439dd568, 0xbe1cf0daa4e00e35 + .quad 0xc086289f49f19df8, 0xbe1cf00b80de8d6a + .quad 0xc08628a14fc258c8, 0xbe1cf1bcf2ea8464 + .quad 0xc08628a355104818, 0xbe1cf0435e2782b0 + .quad 0xc08628a559dbade0, 0xbe1cf0e3e1a5f56c + .quad 0xc08628a75e24cbf8, 0xbe1cefed9d5a721d + .quad 0xc08628a961ebe3f8, 0xbe1cf0d2d74321e2 + .quad 0xc08628ab65313750, 0xbe1cf24200eb55e9 + .quad 0xc08628ad67f50740, 0xbe1cf23e9d7cf979 + .quad 0xc08628af6a3794d0, 0xbe1cf23a088f421c + .quad 0xc08628b16bf920e0, 0xbe1cef2c1de1ab32 + .quad 0xc08628b36d39ec08, 0xbe1cf1abc231f7b2 + .quad 0xc08628b56dfa36d0, 0xbe1cf2074d5ba303 + .quad 0xc08628b76e3a4180, 0xbe1cf05cd5eed880 + /*== Log_LA_table ==*/ + .align 32 + .quad 0x8000000000000000 + .quad 0xbf5ff802a9ab10e6 + .quad 0xbf6ff00aa2b10bc0 + .quad 0xbf77ee11ebd82e94 + .quad 0xbf7fe02a6b106789 + .quad 0xbf83e7295d25a7d9 + .quad 0xbf87dc475f810a77 + .quad 0xbf8bcf712c74384c + .quad 0xbf8fc0a8b0fc03e4 + .quad 0xbf91d7f7eb9eebe7 + .quad 0xbf93cea44346a575 + .quad 0xbf95c45a51b8d389 + .quad 0xbf97b91b07d5b11b + .quad 0xbf99ace7551cc514 + .quad 0xbf9b9fc027af9198 + .quad 0xbf9d91a66c543cc4 + .quad 0xbf9f829b0e783300 + .quad 0xbfa0b94f7c196176 + .quad 0xbfa1b0d98923d980 + .quad 0xbfa2a7ec2214e873 + .quad 0xbfa39e87b9febd60 + .quad 0xbfa494acc34d911c + .quad 0xbfa58a5bafc8e4d5 + .quad 0xbfa67f94f094bd98 + .quad 0xbfa77458f632dcfc + .quad 0xbfa868a83083f6cf + .quad 0xbfa95c830ec8e3eb + .quad 0xbfaa4fe9ffa3d235 + .quad 0xbfab42dd711971bf + .quad 0xbfac355dd0921f2d + .quad 0xbfad276b8adb0b52 + .quad 0xbfae19070c276016 + .quad 0xbfaf0a30c01162a6 + .quad 0xbfaffae9119b9303 + .quad 0xbfb075983598e471 + .quad 0xbfb0ed839b5526fe + .quad 0xbfb16536eea37ae1 + .quad 0xbfb1dcb263db1944 + .quad 0xbfb253f62f0a1417 + .quad 0xbfb2cb0283f5de1f + .quad 0xbfb341d7961bd1d1 + .quad 0xbfb3b87598b1b6ee + .quad 0xbfb42edcbea646f0 + .quad 0xbfb4a50d3aa1b040 + .quad 0xbfb51b073f06183f + .quad 0xbfb590cafdf01c28 + .quad 0xbfb60658a93750c4 + .quad 0xbfb67bb0726ec0fc + .quad 0xbfb6f0d28ae56b4c + .quad 0xbfb765bf23a6be13 + .quad 0xbfb7da766d7b12cd + .quad 0xbfb84ef898e8282a + .quad 0xbfb8c345d6319b21 + .quad 0xbfb9375e55595ede + .quad 0xbfb9ab42462033ad + .quad 0xbfba1ef1d8061cd4 + .quad 0xbfba926d3a4ad563 + .quad 0xbfbb05b49bee43fe + .quad 0xbfbb78c82bb0eda1 + .quad 0xbfbbeba818146765 + .quad 0xbfbc5e548f5bc743 + .quad 0xbfbcd0cdbf8c13e1 + .quad 0xbfbd4313d66cb35d + .quad 0xbfbdb5270187d927 + .quad 0xbfbe27076e2af2e6 + .quad 0xbfbe98b549671467 + .quad 0xbfbf0a30c01162a6 + .quad 0xbfbf7b79fec37ddf + .quad 0xbfbfec9131dbeabb + .quad 0xbfc02ebb42bf3d4b + .quad 0xbfc0671512ca596e + .quad 0xbfc09f561ee719c3 + .quad 0xbfc0d77e7cd08e59 + .quad 0xbfc10f8e422539b1 + .quad 0xbfc14785846742ac + .quad 0xbfc17f6458fca611 + .quad 0xbfc1b72ad52f67a0 + .quad 0xbfc1eed90e2dc2c3 + .quad 0xbfc2266f190a5acb + .quad 0xbfc25ded0abc6ad2 + .quad 0xbfc29552f81ff523 + .quad 0xbfc2cca0f5f5f251 + .quad 0xbfc303d718e47fd3 + .quad 0xbfc33af575770e4f + .quad 0xbfc371fc201e8f74 + .quad 0xbfc3a8eb2d31a376 + .quad 0xbfc3dfc2b0ecc62a + .quad 0xbfc41682bf727bc0 + .quad 0xbfc44d2b6ccb7d1e + .quad 0xbfc483bccce6e3dd + .quad 0xbfc4ba36f39a55e5 + .quad 0xbfc4f099f4a230b2 + .quad 0xbfc526e5e3a1b438 + .quad 0xbfc55d1ad4232d6f + .quad 0xbfc59338d9982086 + .quad 0xbfc5c940075972b9 + .quad 0xbfc5ff3070a793d4 + .quad 0xbfc6350a28aaa758 + .quad 0xbfc66acd4272ad51 + .quad 0xbfc6a079d0f7aad2 + .quad 0xbfc6d60fe719d21d + .quad 0xbfc70b8f97a1aa75 + .quad 0xbfc740f8f54037a5 + .quad 0xbfc7764c128f2127 + .quad 0xbfc7ab890210d909 + .quad 0xbfc7e0afd630c274 + .quad 0xbfc815c0a14357eb + .quad 0xbfc84abb75865139 + .quad 0xbfc87fa06520c911 + .quad 0xbfc8b46f8223625b + .quad 0xbfc8e928de886d41 + .quad 0xbfc91dcc8c340bde + .quad 0xbfc9525a9cf456b4 + .quad 0xbfc986d3228180ca + .quad 0xbfc9bb362e7dfb83 + .quad 0xbfc9ef83d2769a34 + .quad 0xbfca23bc1fe2b563 + .quad 0xbfca57df28244dcd + .quad 0xbfca8becfc882f19 + .quad 0xbfcabfe5ae46124c + .quad 0xbfcaf3c94e80bff3 + .quad 0xbfcb2797ee46320c + .quad 0xbfcb5b519e8fb5a4 + .quad 0xbfcb8ef670420c3b + .quad 0xbfcbc286742d8cd6 + .quad 0xbfcbf601bb0e44e2 + .quad 0xbfcc2968558c18c1 + .quad 0xbfcc5cba543ae425 + .quad 0xbfcc8ff7c79a9a22 + .quad 0xbfccc320c0176502 + .quad 0xbfccf6354e09c5dc + .quad 0xbfcd293581b6b3e7 + .quad 0xbfcd5c216b4fbb91 + .quad 0xbfcd8ef91af31d5e + .quad 0xbfcdc1bca0abec7d + .quad 0xbfcdf46c0c722d2f + .quad 0xbfce27076e2af2e6 + .quad 0xbfce598ed5a87e2f + .quad 0xbfce8c0252aa5a60 + .quad 0xbfcebe61f4dd7b0b + .quad 0xbfcef0adcbdc5936 + .quad 0xbfcf22e5e72f105d + .quad 0xbfcf550a564b7b37 + .quad 0xbfcf871b28955045 + .quad 0xbfcfb9186d5e3e2b + .quad 0xbfcfeb0233e607cc + .quad 0xbfd00e6c45ad501d + .quad 0xbfd0274dc16c232f + .quad 0xbfd0402594b4d041 + .quad 0xbfd058f3c703ebc6 + .quad 0xbfd071b85fcd590d + .quad 0xbfd08a73667c57af + .quad 0xbfd0a324e27390e3 + .quad 0xbfd0bbccdb0d24bd + .quad 0xbfd0d46b579ab74b + .quad 0xbfd0ed005f657da4 + .quad 0xbfd1058bf9ae4ad5 + .quad 0xbfd11e0e2dad9cb7 + .quad 0xbfd136870293a8b0 + .quad 0xbfd14ef67f88685a + .quad 0xbfd1675cababa60e + .quad 0xbfd17fb98e15095d + .quad 0xbfd1980d2dd4236f + .quad 0xbfd1b05791f07b49 + .quad 0xbfd1c898c16999fb + .quad 0xbfd1e0d0c33716be + .quad 0xbfd1f8ff9e48a2f3 + .quad 0xbfd211255986160c + .quad 0xbfd22941fbcf7966 + .quad 0xbfd241558bfd1404 + .quad 0xbfd2596010df763a + .quad 0xbfd27161913f853d + .quad 0xbfd2895a13de86a3 + .quad 0xbfd2a1499f762bc9 + .quad 0xbfd2b9303ab89d25 + .quad 0xbfd2d10dec508583 + .quad 0xbfd2e8e2bae11d31 + .quad 0xbfd300aead06350c + .quad 0xbfd31871c9544185 + .quad 0xbfd3302c16586588 + .quad 0xbfd347dd9a987d55 + .quad 0xbfd35f865c93293e + .quad 0xbfd3772662bfd85b + .quad 0xbfd38ebdb38ed321 + .quad 0xbfd3a64c556945ea + .quad 0xbfd3bdd24eb14b6a + .quad 0xbfd3d54fa5c1f710 + .quad 0xbfd3ecc460ef5f50 + .quad 0xbfd404308686a7e4 + .quad 0xbfd41b941cce0bee + .quad 0xbfd432ef2a04e814 + .quad 0xbfd44a41b463c47c + .quad 0xbfd4618bc21c5ec2 + .quad 0xbfd478cd5959b3d9 + .quad 0xbfd49006804009d1 + .quad 0xbfd4a7373cecf997 + .quad 0xbfd4be5f957778a1 + .quad 0xbfd4d57f8fefe27f + .quad 0xbfd4ec973260026a + .quad 0xbfd503a682cb1cb3 + .quad 0xbfd51aad872df82d + .quad 0xbfd531ac457ee77e + .quad 0xbfd548a2c3add263 + .quad 0xbfd55f9107a43ee2 + .quad 0xbfd5767717455a6c + .quad 0xbfd58d54f86e02f2 + .quad 0xbfd5a42ab0f4cfe2 + .quad 0xbfd5baf846aa1b19 + .quad 0xbfd5d1bdbf5809ca + .quad 0xbfd5e87b20c2954a + .quad 0xbfd5ff3070a793d4 + .quad 0xbfd615ddb4bec13c + .quad 0xbfd62c82f2b9c795 + .quad 0x3fd61965cdb02c1f + .quad 0x3fd602d08af091ec + .quad 0x3fd5ec433d5c35ae + .quad 0x3fd5d5bddf595f30 + .quad 0x3fd5bf406b543db2 + .quad 0x3fd5a8cadbbedfa1 + .quad 0x3fd5925d2b112a59 + .quad 0x3fd57bf753c8d1fb + .quad 0x3fd565995069514c + .quad 0x3fd54f431b7be1a9 + .quad 0x3fd538f4af8f72fe + .quad 0x3fd522ae0738a3d8 + .quad 0x3fd50c6f1d11b97c + .quad 0x3fd4f637ebba9810 + .quad 0x3fd4e0086dd8baca + .quad 0x3fd4c9e09e172c3c + .quad 0x3fd4b3c077267e9a + .quad 0x3fd49da7f3bcc41f + .quad 0x3fd487970e958770 + .quad 0x3fd4718dc271c41b + .quad 0x3fd45b8c0a17df13 + .quad 0x3fd44591e0539f49 + .quad 0x3fd42f9f3ff62642 + .quad 0x3fd419b423d5e8c7 + .quad 0x3fd403d086cea79c + .quad 0x3fd3edf463c1683e + .quad 0x3fd3d81fb5946dba + .quad 0x3fd3c25277333184 + .quad 0x3fd3ac8ca38e5c5f + .quad 0x3fd396ce359bbf54 + .quad 0x3fd3811728564cb2 + .quad 0x3fd36b6776be1117 + .quad 0x3fd355bf1bd82c8b + .quad 0x3fd3401e12aecba1 + .quad 0x3fd32a84565120a8 + .quad 0x3fd314f1e1d35ce4 + .quad 0x3fd2ff66b04ea9d4 + .quad 0x3fd2e9e2bce12286 + .quad 0x3fd2d46602adccee + .quad 0x3fd2bef07cdc9354 + .quad 0x3fd2a982269a3dbf + .quad 0x3fd2941afb186b7c + .quad 0x3fd27ebaf58d8c9d + .quad 0x3fd269621134db92 + .quad 0x3fd25410494e56c7 + .quad 0x3fd23ec5991eba49 + .quad 0x3fd22981fbef797b + .quad 0x3fd214456d0eb8d4 + .quad 0x3fd1ff0fe7cf47a7 + .quad 0x3fd1e9e1678899f4 + .quad 0x3fd1d4b9e796c245 + .quad 0x3fd1bf99635a6b95 + .quad 0x3fd1aa7fd638d33f + .quad 0x3fd1956d3b9bc2fa + .quad 0x3fd180618ef18adf + .quad 0x3fd16b5ccbacfb73 + .quad 0x3fd1565eed455fc3 + .quad 0x3fd14167ef367783 + .quad 0x3fd12c77cd00713b + .quad 0x3fd1178e8227e47c + .quad 0x3fd102ac0a35cc1c + .quad 0x3fd0edd060b78081 + .quad 0x3fd0d8fb813eb1ef + .quad 0x3fd0c42d676162e3 + .quad 0x3fd0af660eb9e279 + .quad 0x3fd09aa572e6c6d4 + .quad 0x3fd085eb8f8ae797 + .quad 0x3fd07138604d5862 + .quad 0x3fd05c8be0d9635a + .quad 0x3fd047e60cde83b8 + .quad 0x3fd03346e0106062 + .quad 0x3fd01eae5626c691 + .quad 0x3fd00a1c6adda473 + .quad 0x3fcfeb2233ea07cd + .quad 0x3fcfc218be620a5e + .quad 0x3fcf991c6cb3b379 + .quad 0x3fcf702d36777df0 + .quad 0x3fcf474b134df229 + .quad 0x3fcf1e75fadf9bde + .quad 0x3fcef5ade4dcffe6 + .quad 0x3fceccf2c8fe920a + .quad 0x3fcea4449f04aaf5 + .quad 0x3fce7ba35eb77e2a + .quad 0x3fce530effe71012 + .quad 0x3fce2a877a6b2c12 + .quad 0x3fce020cc6235ab5 + .quad 0x3fcdd99edaf6d7e9 + .quad 0x3fcdb13db0d48940 + .quad 0x3fcd88e93fb2f450 + .quad 0x3fcd60a17f903515 + .quad 0x3fcd38666871f465 + .quad 0x3fcd1037f2655e7b + .quad 0x3fcce816157f1988 + .quad 0x3fccc000c9db3c52 + .quad 0x3fcc97f8079d44ec + .quad 0x3fcc6ffbc6f00f71 + .quad 0x3fcc480c0005ccd1 + .quad 0x3fcc2028ab17f9b4 + .quad 0x3fcbf851c067555f + .quad 0x3fcbd087383bd8ad + .quad 0x3fcba8c90ae4ad19 + .quad 0x3fcb811730b823d2 + .quad 0x3fcb5971a213acdb + .quad 0x3fcb31d8575bce3d + .quad 0x3fcb0a4b48fc1b46 + .quad 0x3fcae2ca6f672bd4 + .quad 0x3fcabb55c31693ad + .quad 0x3fca93ed3c8ad9e3 + .quad 0x3fca6c90d44b704e + .quad 0x3fca454082e6ab05 + .quad 0x3fca1dfc40f1b7f1 + .quad 0x3fc9f6c407089664 + .quad 0x3fc9cf97cdce0ec3 + .quad 0x3fc9a8778debaa38 + .quad 0x3fc981634011aa75 + .quad 0x3fc95a5adcf7017f + .quad 0x3fc9335e5d594989 + .quad 0x3fc90c6db9fcbcd9 + .quad 0x3fc8e588ebac2dbf + .quad 0x3fc8beafeb38fe8c + .quad 0x3fc897e2b17b19a5 + .quad 0x3fc871213750e994 + .quad 0x3fc84a6b759f512f + .quad 0x3fc823c16551a3c2 + .quad 0x3fc7fd22ff599d4f + .quad 0x3fc7d6903caf5ad0 + .quad 0x3fc7b0091651528c + .quad 0x3fc7898d85444c73 + .quad 0x3fc7631d82935a86 + .quad 0x3fc73cb9074fd14d + .quad 0x3fc716600c914054 + .quad 0x3fc6f0128b756abc + .quad 0x3fc6c9d07d203fc7 + .quad 0x3fc6a399dabbd383 + .quad 0x3fc67d6e9d785771 + .quad 0x3fc6574ebe8c133a + .quad 0x3fc6313a37335d76 + .quad 0x3fc60b3100b09476 + .quad 0x3fc5e533144c1719 + .quad 0x3fc5bf406b543db2 + .quad 0x3fc59958ff1d52f1 + .quad 0x3fc5737cc9018cdd + .quad 0x3fc54dabc26105d2 + .quad 0x3fc527e5e4a1b58d + .quad 0x3fc5022b292f6a45 + .quad 0x3fc4dc7b897bc1c8 + .quad 0x3fc4b6d6fefe22a4 + .quad 0x3fc4913d8333b561 + .quad 0x3fc46baf0f9f5db7 + .quad 0x3fc4462b9dc9b3dc + .quad 0x3fc420b32740fdd4 + .quad 0x3fc3fb45a59928cc + .quad 0x3fc3d5e3126bc27f + .quad 0x3fc3b08b6757f2a9 + .quad 0x3fc38b3e9e027479 + .quad 0x3fc365fcb0159016 + .quad 0x3fc340c59741142e + .quad 0x3fc31b994d3a4f85 + .quad 0x3fc2f677cbbc0a96 + .quad 0x3fc2d1610c86813a + .quad 0x3fc2ac55095f5c59 + .quad 0x3fc28753bc11aba5 + .quad 0x3fc2625d1e6ddf57 + .quad 0x3fc23d712a49c202 + .quad 0x3fc2188fd9807263 + .quad 0x3fc1f3b925f25d41 + .quad 0x3fc1ceed09853752 + .quad 0x3fc1aa2b7e23f72a + .quad 0x3fc185747dbecf34 + .quad 0x3fc160c8024b27b1 + .quad 0x3fc13c2605c398c3 + .quad 0x3fc1178e8227e47c + .quad 0x3fc0f301717cf0fb + .quad 0x3fc0ce7ecdccc28d + .quad 0x3fc0aa06912675d5 + .quad 0x3fc08598b59e3a07 + .quad 0x3fc06135354d4b18 + .quad 0x3fc03cdc0a51ec0d + .quad 0x3fc0188d2ecf6140 + .quad 0x3fbfe89139dbd566 + .quad 0x3fbfa01c9db57ce2 + .quad 0x3fbf57bc7d9005db + .quad 0x3fbf0f70cdd992e3 + .quad 0x3fbec739830a1120 + .quad 0x3fbe7f1691a32d3e + .quad 0x3fbe3707ee30487b + .quad 0x3fbdef0d8d466db9 + .quad 0x3fbda727638446a2 + .quad 0x3fbd5f55659210e2 + .quad 0x3fbd179788219364 + .quad 0x3fbccfedbfee13a8 + .quad 0x3fbc885801bc4b23 + .quad 0x3fbc40d6425a5cb1 + .quad 0x3fbbf968769fca11 + .quad 0x3fbbb20e936d6974 + .quad 0x3fbb6ac88dad5b1c + .quad 0x3fbb23965a52ff00 + .quad 0x3fbadc77ee5aea8c + .quad 0x3fba956d3ecade63 + .quad 0x3fba4e7640b1bc38 + .quad 0x3fba0792e9277cac + .quad 0x3fb9c0c32d4d2548 + .quad 0x3fb97a07024cbe74 + .quad 0x3fb9335e5d594989 + .quad 0x3fb8ecc933aeb6e8 + .quad 0x3fb8a6477a91dc29 + .quad 0x3fb85fd927506a48 + .quad 0x3fb8197e2f40e3f0 + .quad 0x3fb7d33687c293c9 + .quad 0x3fb78d02263d82d3 + .quad 0x3fb746e100226ed9 + .quad 0x3fb700d30aeac0e1 + .quad 0x3fb6bad83c1883b6 + .quad 0x3fb674f089365a7a + .quad 0x3fb62f1be7d77743 + .quad 0x3fb5e95a4d9791cb + .quad 0x3fb5a3abb01ade25 + .quad 0x3fb55e10050e0384 + .quad 0x3fb518874226130a + .quad 0x3fb4d3115d207eac + .quad 0x3fb48dae4bc31018 + .quad 0x3fb4485e03dbdfad + .quad 0x3fb403207b414b7f + .quad 0x3fb3bdf5a7d1ee64 + .quad 0x3fb378dd7f749714 + .quad 0x3fb333d7f8183f4b + .quad 0x3fb2eee507b40301 + .quad 0x3fb2aa04a44717a5 + .quad 0x3fb26536c3d8c369 + .quad 0x3fb2207b5c78549e + .quad 0x3fb1dbd2643d190b + .quad 0x3fb1973bd1465567 + .quad 0x3fb152b799bb3cc9 + .quad 0x3fb10e45b3cae831 + .quad 0x3fb0c9e615ac4e17 + .quad 0x3fb08598b59e3a07 + .quad 0x3fb0415d89e74444 + .quad 0x3faffa6911ab9301 + .quad 0x3faf723b517fc523 + .quad 0x3faeea31c006b87c + .quad 0x3fae624c4a0b5e1b + .quad 0x3fadda8adc67ee4e + .quad 0x3fad52ed6405d86f + .quad 0x3faccb73cdddb2cc + .quad 0x3fac441e06f72a9e + .quad 0x3fabbcebfc68f420 + .quad 0x3fab35dd9b58baad + .quad 0x3faaaef2d0fb10fc + .quad 0x3faa282b8a936171 + .quad 0x3fa9a187b573de7c + .quad 0x3fa91b073efd7314 + .quad 0x3fa894aa149fb343 + .quad 0x3fa80e7023d8ccc4 + .quad 0x3fa788595a3577ba + .quad 0x3fa70265a550e777 + .quad 0x3fa67c94f2d4bb58 + .quad 0x3fa5f6e73078efb8 + .quad 0x3fa5715c4c03ceef + .quad 0x3fa4ebf43349e26f + .quad 0x3fa466aed42de3ea + .quad 0x3fa3e18c1ca0ae92 + .quad 0x3fa35c8bfaa1306b + .quad 0x3fa2d7ae5c3c5bae + .quad 0x3fa252f32f8d183f + .quad 0x3fa1ce5a62bc353a + .quad 0x3fa149e3e4005a8d + .quad 0x3fa0c58fa19dfaaa + .quad 0x3fa0415d89e74444 + .quad 0x3f9f7a9b16782856 + .quad 0x3f9e72bf2813ce51 + .quad 0x3f9d6b2725979802 + .quad 0x3f9c63d2ec14aaf2 + .quad 0x3f9b5cc258b718e6 + .quad 0x3f9a55f548c5c43f + .quad 0x3f994f6b99a24475 + .quad 0x3f98492528c8cabf + .quad 0x3f974321d3d006d3 + .quad 0x3f963d6178690bd6 + .quad 0x3f9537e3f45f3565 + .quad 0x3f9432a925980cc1 + .quad 0x3f932db0ea132e22 + .quad 0x3f9228fb1fea2e28 + .quad 0x3f912487a5507f70 + .quad 0x3f90205658935847 + .quad 0x3f8e38ce3033310c + .quad 0x3f8c317384c75f06 + .quad 0x3f8a2a9c6c170462 + .quad 0x3f882448a388a2aa + .quad 0x3f861e77e8b53fc6 + .quad 0x3f841929f96832f0 + .quad 0x3f82145e939ef1e9 + .quad 0x3f8010157588de71 + .quad 0x3f7c189cbb0e27fb + .quad 0x3f78121214586b54 + .quad 0x3f740c8a747878e2 + .quad 0x3f70080559588b35 + .quad 0x3f680904828985c0 + .quad 0x3f60040155d5889e + .quad 0x3f50020055655889 + .quad 0x0000000000000000 + /*== poly_coeff[4] ==*/ + .align 32 + .quad 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A, 0x3fc9999CACDB4D0A /* coeff4 */ + .quad 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1, 0xbfd0000148058EE1 /* coeff3 */ + .quad 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5, 0x3fd55555555543C5 /* coeff2 */ + .quad 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F, 0xbfdFFFFFFFFFF81F /* coeff1 */ + /*== ExpMask ==*/ + .align 32 + .quad 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff, 0x000fffffffffffff + /*== Two10 ==*/ + .align 32 + .quad 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000, 0x3f50000000000000 + /*== MinLog1p = -1+2^(-53) ==*/ + .align 32 + .quad 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff, 0xbfefffffffffffff + /*== MaxLog1p ==*/ + .align 32 + .quad 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000, 0x7f3ffffffffff000 + /*== One ==*/ + .align 32 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== SgnMask ==*/ + .align 32 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== XThreshold ==*/ + .align 32 + .quad 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000, 0x3e00000000000000 + /*== XhMask ==*/ + .align 32 + .quad 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00, 0xfffffffffffffc00 + /*== Threshold ==*/ + .align 32 + .quad 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000, 0x4086a00000000000 + /*== Bias ==*/ + .align 32 + .quad 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000, 0x408ff80000000000 + /*== Bias1 ==*/ + .align 32 + .quad 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000, 0x408ff00000000000 + /*== ExpMask ==*/ + .align 32 + .quad 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000, 0x7ff0000000000000 + /*== ExpMask2 ==*/ + .align 32 + .quad 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000, 0x7f40000000000000 + /*== L2L ==*/ + .align 32 + .quad 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF, 0x3fe62E42FEFA39EF + /*== dBigThreshold ==*/ + .align 32 + .quad 0x41D0000000000000, 0x41D0000000000000, 0x41D0000000000000, 0x41D0000000000000 + /*== dC2 ==*/ + .align 32 + .quad 0x3FD8000000000000, 0x3FD8000000000000, 0x3FD8000000000000, 0x3FD8000000000000 + /*== dC3 ==*/ + .align 32 + .quad 0x3FD4000000000000, 0x3FD4000000000000, 0x3FD4000000000000, 0x3FD4000000000000 + /*== dC4 ==*/ + .align 32 + .quad 0x3FD1800000000000, 0x3FD1800000000000, 0x3FD1800000000000, 0x3FD1800000000000 + /*== dC5 ==*/ + .align 32 + .quad 0x3FCF800000000000, 0x3FCF800000000000, 0x3FCF800000000000, 0x3FCF800000000000 + /*== dHalf ==*/ + .align 32 + .quad 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000, 0x3FE0000000000000 + /*== dLargestFinite ==*/ + .align 32 + .quad 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF, 0x7FEFFFFFFFFFFFFF + /*== dLittleThreshold ==*/ + .align 32 + .quad 0x3F60000000000000, 0x3F60000000000000, 0x3F60000000000000, 0x3F60000000000000 + /*== dSign ==*/ + .align 32 + .quad 0x8000000000000000, 0x8000000000000000, 0x8000000000000000, 0x8000000000000000 + /*== dThirtyOne ==*/ + .align 32 + .quad 0x403F000000000000, 0x403F000000000000, 0x403F000000000000, 0x403F000000000000 + /*== dTopMask12 ==*/ + .align 32 + .quad 0xFFFFFE0000000000, 0xFFFFFE0000000000, 0xFFFFFE0000000000, 0xFFFFFE0000000000 + /*== dTopMask29 ==*/ + .align 32 + .quad 0xFFFFFFFFFF000000, 0xFFFFFFFFFF000000, 0xFFFFFFFFFF000000, 0xFFFFFFFFFF000000 + /*== XScale ==*/ + .align 32 + .quad 0x3E10000000000000, 0x3E10000000000000, 0x3E10000000000000, 0x3E10000000000000 + .align 32 + .type __svml_dasinh_data_internal,@object + .size __svml_dasinh_data_internal,.-__svml_dasinh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core-avx2.S new file mode 100644 index 0000000000..647c73292c --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized asinh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN8v_asinh _ZGVeN8v_asinh_avx2_wrapper +#include "../svml_d_asinh8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core.c new file mode 100644 index 0000000000..45e5ab72a6 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core.c @@ -0,0 +1,27 @@ +/* Multiple versions of vectorized asinh, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN8v_asinh +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN8v_asinh, __GI__ZGVeN8v_asinh, __redirect__ZGVeN8v_asinh) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core_avx512.S new file mode 100644 index 0000000000..8100e8a50a --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_d_asinh8_core_avx512.S @@ -0,0 +1,510 @@ +/* Function asinh vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute asinh(x) as log(x + sqrt(x*x + 1)) + * using RSQRT instructions for starting the + * square root approximation, and small table lookups for log + * that map to AVX-512 permute instructions + * + * Special cases: + * + * asinh(NaN) = quiet NaN, and raise invalid exception + * asinh(INF) = that INF + * asinh(0) = that 0 + * + */ + +/* Offsets for data table __svml_dasinh_data_internal_avx512 + */ +#define Log_tbl_H 0 +#define Log_tbl_L 128 +#define One 256 +#define AbsMask 320 +#define SmallThreshold 384 +#define Threshold 448 +#define LargeThreshold 512 +#define ca2 576 +#define ca1 640 +#define c4s 704 +#define c3s 768 +#define c2s 832 +#define c1s 896 +#define AddB5 960 +#define RcpBitMask 1024 +#define OneEighth 1088 +#define Four 1152 +#define poly_coeff9 1216 +#define poly_coeff8 1280 +#define poly_coeff7 1344 +#define poly_coeff6 1408 +#define poly_coeff5 1472 +#define poly_coeff4 1536 +#define poly_coeff3 1600 +#define poly_coeff2 1664 +#define poly_coeff1 1728 +#define L2H 1792 +#define L2L 1856 + +#include + + .text + .section .text.evex512,"ax",@progbits +ENTRY(_ZGVeN8v_asinh_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovaps %zmm0, %zmm3 + +/* x^2 */ + vmulpd {rn-sae}, %zmm3, %zmm3, %zmm14 + vmovups One+__svml_dasinh_data_internal_avx512(%rip), %zmm9 + +/* polynomial computation for small inputs */ + vmovups ca2+__svml_dasinh_data_internal_avx512(%rip), %zmm10 + vmovups ca1+__svml_dasinh_data_internal_avx512(%rip), %zmm11 + +/* not a very small input ? */ + vmovups SmallThreshold+__svml_dasinh_data_internal_avx512(%rip), %zmm0 + +/* A=max(x^2, 1); */ + vmaxpd {sae}, %zmm14, %zmm9, %zmm4 + +/* B=min(x^2, 1); */ + vminpd {sae}, %zmm14, %zmm9, %zmm5 + vfmadd231pd {rn-sae}, %zmm14, %zmm10, %zmm11 + +/* 1+x^2 */ + vaddpd {rn-sae}, %zmm9, %zmm14, %zmm8 + +/* |input| */ + vandpd AbsMask+__svml_dasinh_data_internal_avx512(%rip), %zmm3, %zmm1 + vrsqrt14pd %zmm8, %zmm6 + vcmppd $21, {sae}, %zmm0, %zmm1, %k2 + +/* B_high */ + vsubpd {rn-sae}, %zmm4, %zmm8, %zmm7 + +/* sign bit */ + vxorpd %zmm3, %zmm1, %zmm2 + vmulpd {rn-sae}, %zmm14, %zmm11, %zmm4 + +/* B_low */ + vsubpd {rn-sae}, %zmm7, %zmm5, %zmm13 + vmovups c2s+__svml_dasinh_data_internal_avx512(%rip), %zmm5 + vmovups c1s+__svml_dasinh_data_internal_avx512(%rip), %zmm7 + +/* polynomial computation for small inputs */ + vfmadd213pd {rn-sae}, %zmm1, %zmm1, %zmm4 + +/* (x^2)_low */ + vmovaps %zmm3, %zmm15 + vfmsub213pd {rn-sae}, %zmm14, %zmm3, %zmm15 + +/* Sh ~sqrt(1+x^2) */ + vmulpd {rn-sae}, %zmm6, %zmm8, %zmm14 + +/* Yl = (x^2)_low + B_low */ + vaddpd {rn-sae}, %zmm15, %zmm13, %zmm13 + +/* very large inputs ? */ + vmovups Threshold+__svml_dasinh_data_internal_avx512(%rip), %zmm15 + +/* (Yh*R0)_low */ + vfmsub213pd {rn-sae}, %zmm14, %zmm6, %zmm8 + vcmppd $21, {sae}, %zmm15, %zmm1, %k1 + +/* Sl = (Yh*R0)_low+(R0*Yl) */ + vfmadd213pd {rn-sae}, %zmm8, %zmm6, %zmm13 + vmovups LargeThreshold+__svml_dasinh_data_internal_avx512(%rip), %zmm8 + +/* rel. error term: Eh=1-Sh*R0 */ + vmovaps %zmm9, %zmm12 + vfnmadd231pd {rn-sae}, %zmm14, %zmm6, %zmm12 + vcmppd $22, {sae}, %zmm8, %zmm1, %k0 + +/* rel. error term: Eh=(1-Sh*R0)-Sl*R0 */ + vfnmadd231pd {rn-sae}, %zmm13, %zmm6, %zmm12 + +/* + * sqrt(1+x^2) ~ Sh + Sl + Sh*Eh*poly_s + * poly_s = c1+c2*Eh+c3*Eh^2 + */ + vmovups c4s+__svml_dasinh_data_internal_avx512(%rip), %zmm6 + vmovups c3s+__svml_dasinh_data_internal_avx512(%rip), %zmm8 + +/* Sh*Eh */ + vmulpd {rn-sae}, %zmm12, %zmm14, %zmm11 + vfmadd231pd {rn-sae}, %zmm12, %zmm6, %zmm8 + +/* Sh+x */ + vaddpd {rn-sae}, %zmm1, %zmm14, %zmm6 + kmovw %k0, %edx + vfmadd213pd {rn-sae}, %zmm5, %zmm12, %zmm8 + vfmadd213pd {rn-sae}, %zmm7, %zmm12, %zmm8 + +/* Xh */ + vsubpd {rn-sae}, %zmm14, %zmm6, %zmm12 + +/* Sl + Sh*Eh*poly_s */ + vfmadd213pd {rn-sae}, %zmm13, %zmm8, %zmm11 + +/* fixup for very large inputs */ + vmovups OneEighth+__svml_dasinh_data_internal_avx512(%rip), %zmm8 + +/* Xl */ + vsubpd {rn-sae}, %zmm12, %zmm1, %zmm12 + +/* Xin0+Sl+Sh*Eh*poly_s ~ x+sqrt(1+x^2) */ + vaddpd {rn-sae}, %zmm11, %zmm6, %zmm10 + +/* Sl_high */ + vsubpd {rn-sae}, %zmm6, %zmm10, %zmm5 + vmulpd {rn-sae}, %zmm8, %zmm1, %zmm10{%k1} + +/* Table lookups */ + vmovups __svml_dasinh_data_internal_avx512(%rip), %zmm6 + +/* Sl_l */ + vsubpd {rn-sae}, %zmm5, %zmm11, %zmm7 + vrcp14pd %zmm10, %zmm13 + +/* Xin_low */ + vaddpd {rn-sae}, %zmm12, %zmm7, %zmm14 + vmovups Log_tbl_L+__svml_dasinh_data_internal_avx512(%rip), %zmm7 + vmovups poly_coeff6+__svml_dasinh_data_internal_avx512(%rip), %zmm12 + +/* round reciprocal to 1+4b mantissas */ + vpaddq AddB5+__svml_dasinh_data_internal_avx512(%rip), %zmm13, %zmm11 + +/* fixup for very large inputs */ + vxorpd %zmm14, %zmm14, %zmm14{%k1} + vmovups poly_coeff5+__svml_dasinh_data_internal_avx512(%rip), %zmm13 + vandpd RcpBitMask+__svml_dasinh_data_internal_avx512(%rip), %zmm11, %zmm15 + vmovups poly_coeff7+__svml_dasinh_data_internal_avx512(%rip), %zmm11 + +/* Prepare table index */ + vpsrlq $48, %zmm15, %zmm5 + +/* reduced argument for log(): (Rcp*Xin-1)+Rcp*Xin_low */ + vfmsub231pd {rn-sae}, %zmm15, %zmm10, %zmm9 + +/* exponents */ + vgetexppd {sae}, %zmm15, %zmm8 + vmovups Four+__svml_dasinh_data_internal_avx512(%rip), %zmm10 + vpermt2pd Log_tbl_H+64+__svml_dasinh_data_internal_avx512(%rip), %zmm5, %zmm6 + vpermt2pd Log_tbl_L+64+__svml_dasinh_data_internal_avx512(%rip), %zmm5, %zmm7 + vsubpd {rn-sae}, %zmm10, %zmm8, %zmm8{%k1} + vfmadd231pd {rn-sae}, %zmm15, %zmm14, %zmm9 + +/* polynomials */ + vmovups poly_coeff9+__svml_dasinh_data_internal_avx512(%rip), %zmm10 + vmovups poly_coeff8+__svml_dasinh_data_internal_avx512(%rip), %zmm5 + vmovups poly_coeff4+__svml_dasinh_data_internal_avx512(%rip), %zmm14 + +/* -K*L2H + Th */ + vmovups L2H+__svml_dasinh_data_internal_avx512(%rip), %zmm15 + vfmadd231pd {rn-sae}, %zmm9, %zmm10, %zmm5 + +/* -K*L2L + Tl */ + vmovups L2L+__svml_dasinh_data_internal_avx512(%rip), %zmm10 + vfnmadd231pd {rn-sae}, %zmm8, %zmm15, %zmm6 + vfmadd213pd {rn-sae}, %zmm11, %zmm9, %zmm5 + vfnmadd213pd {rn-sae}, %zmm7, %zmm10, %zmm8 + vmovups poly_coeff3+__svml_dasinh_data_internal_avx512(%rip), %zmm7 + vmovups poly_coeff1+__svml_dasinh_data_internal_avx512(%rip), %zmm10 + +/* R^2 */ + vmulpd {rn-sae}, %zmm9, %zmm9, %zmm11 + vfmadd213pd {rn-sae}, %zmm12, %zmm9, %zmm5 + vfmadd213pd {rn-sae}, %zmm13, %zmm9, %zmm5 + vfmadd213pd {rn-sae}, %zmm14, %zmm9, %zmm5 + vfmadd213pd {rn-sae}, %zmm7, %zmm9, %zmm5 + vmovups poly_coeff2+__svml_dasinh_data_internal_avx512(%rip), %zmm7 + vfmadd213pd {rn-sae}, %zmm7, %zmm9, %zmm5 + vfmadd213pd {rn-sae}, %zmm10, %zmm9, %zmm5 + +/* Tl + R^2*Poly */ + vfmadd213pd {rn-sae}, %zmm8, %zmm11, %zmm5 + +/* R+Tl + R^2*Poly */ + vaddpd {rn-sae}, %zmm9, %zmm5, %zmm9 + vaddpd {rn-sae}, %zmm9, %zmm6, %zmm4{%k2} + vxorpd %zmm2, %zmm4, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm3 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm3, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movsd 64(%rsp,%r14,8), %xmm0 + call asinh@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movsd %xmm0, 128(%rsp,%r14,8) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN8v_asinh_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_dasinh_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl_H[16][2]; + __declspec(align(64)) VUINT32 Log_tbl_L[16][2]; + __declspec(align(64)) VUINT32 One[8][2]; + __declspec(align(64)) VUINT32 AbsMask[8][2]; + __declspec(align(64)) VUINT32 SmallThreshold[8][2]; + __declspec(align(64)) VUINT32 Threshold[8][2]; + __declspec(align(64)) VUINT32 LargeThreshold[8][2]; + __declspec(align(64)) VUINT32 ca2[8][2]; + __declspec(align(64)) VUINT32 ca1[8][2]; + __declspec(align(64)) VUINT32 c4s[8][2]; + __declspec(align(64)) VUINT32 c3s[8][2]; + __declspec(align(64)) VUINT32 c2s[8][2]; + __declspec(align(64)) VUINT32 c1s[8][2]; + __declspec(align(64)) VUINT32 AddB5[8][2]; + __declspec(align(64)) VUINT32 RcpBitMask[8][2]; + __declspec(align(64)) VUINT32 OneEighth[8][2]; + __declspec(align(64)) VUINT32 Four[8][2]; + __declspec(align(64)) VUINT32 poly_coeff9[8][2]; + __declspec(align(64)) VUINT32 poly_coeff8[8][2]; + __declspec(align(64)) VUINT32 poly_coeff7[8][2]; + __declspec(align(64)) VUINT32 poly_coeff6[8][2]; + __declspec(align(64)) VUINT32 poly_coeff5[8][2]; + __declspec(align(64)) VUINT32 poly_coeff4[8][2]; + __declspec(align(64)) VUINT32 poly_coeff3[8][2]; + __declspec(align(64)) VUINT32 poly_coeff2[8][2]; + __declspec(align(64)) VUINT32 poly_coeff1[8][2]; + __declspec(align(64)) VUINT32 L2H[8][2]; + __declspec(align(64)) VUINT32 L2L[8][2]; + } __svml_dasinh_data_internal_avx512; +#endif +__svml_dasinh_data_internal_avx512: + /*== Log_tbl_H ==*/ + .quad 0x0000000000000000 + .quad 0xbfaf0a30c0120000 + .quad 0xbfbe27076e2b0000 + .quad 0xbfc5ff3070a78000 + .quad 0xbfcc8ff7c79a8000 + .quad 0xbfd1675cababc000 + .quad 0xbfd4618bc21c4000 + .quad 0xbfd739d7f6bbc000 + .quad 0xbfd9f323ecbf8000 + .quad 0xbfdc8ff7c79a8000 + .quad 0xbfdf128f5faf0000 + .quad 0xbfe0be72e4252000 + .quad 0xbfe1e85f5e704000 + .quad 0xbfe307d7334f2000 + .quad 0xbfe41d8fe8468000 + .quad 0xbfe52a2d265bc000 + /*== Log_tbl_L ==*/ + .align 64 + .quad 0x0000000000000000 + .quad 0x3d53ab33d066d1d2 + .quad 0x3d2a342c2af0003c + .quad 0xbd43d3c873e20a07 + .quad 0xbd4a21ac25d81ef3 + .quad 0x3d59f1fc63382a8f + .quad 0xbd5ec27d0b7b37b3 + .quad 0xbd50069ce24c53fb + .quad 0xbd584bf2b68d766f + .quad 0xbd5a21ac25d81ef3 + .quad 0xbd3bb2cd720ec44c + .quad 0xbd55056d312f7668 + .quad 0xbd1a07bd8b34be7c + .quad 0x3d5e83c094debc15 + .quad 0x3d5aa33736867a17 + .quad 0xbd46abb9df22bc57 + /*== One ==*/ + .align 64 + .quad 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000, 0x3ff0000000000000 + /*== AbsMask ==*/ + .align 64 + .quad 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff + /*== SmallThreshold ==*/ + .align 64 + .quad 0x3f70000000000000, 0x3f70000000000000, 0x3f70000000000000, 0x3f70000000000000, 0x3f70000000000000, 0x3f70000000000000, 0x3f70000000000000, 0x3f70000000000000 + /*== Threshold ==*/ + .align 64 + .quad 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000, 0x5fe0000000000000 + /*== LargeThreshold ==*/ + .align 64 + .quad 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff, 0x7fefffffffffffff + /*== ca2 ==*/ + .align 64 + .quad 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7, 0x3fb333220eaf02e7 + /*== ca1 ==*/ + .align 64 + .quad 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e, 0xbfc5555555521e7e + /*== c4s ==*/ + .align 64 + .quad 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612, 0x3fd1800001943612 + /*== c3s ==*/ + .align 64 + .quad 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000, 0x3fd40000013b0000 + /*== c2s ==*/ + .align 64 + .quad 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000, 0x3fd8000000000000 + /*== c1s ==*/ + .align 64 + .quad 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000, 0x3fe0000000000000 + /*== AddB5 ==*/ + .align 64 + .quad 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000, 0x0000800000000000 + /*== RcpBitMask ==*/ + .align 64 + .quad 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000, 0xffff000000000000 + /*==OneEighth ==*/ + .align 64 + .quad 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000, 0x3fc0000000000000 + /*== Four ==*/ + .align 64 + .quad 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000, 0x4010000000000000 + /*== poly_coeff9 ==*/ + .align 64 + .quad 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368, 0xbfb9a9b040214368 + /*== poly_coeff8 ==*/ + .align 64 + .quad 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778, 0x3fbc80666e249778 + /*== poly_coeff7 ==*/ + .align 64 + .quad 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9, 0xbfbffffb8a054bc9 + /*== poly_coeff6 ==*/ + .align 64 + .quad 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1, 0x3fc24922f71256f1 + /*== poly_coeff5 ==*/ + .align 64 + .quad 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736, 0xbfc55555559ba736 + /*== poly_coeff4 ==*/ + .align 64 + .quad 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af, 0x3fc9999999be77af + /*== poly_coeff3 ==*/ + .align 64 + .quad 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65, 0xbfcffffffffffc65 + /*== poly_coeff2 ==*/ + .align 64 + .quad 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1, 0x3fd55555555554c1 + /*== poly_coeff1 ==*/ + .align 64 + .quad 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000, 0xbfe0000000000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .quad 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000, 0x3fe62E42FEFA0000 + /*== L2L = log(2)_low ==*/ + .align 64 + .quad 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000, 0x3d7cf79abc9e0000 + .align 64 + .type __svml_dasinh_data_internal_avx512,@object + .size __svml_dasinh_data_internal_avx512,.-__svml_dasinh_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core-avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core-avx2.S new file mode 100644 index 0000000000..7dfd95e400 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core-avx2.S @@ -0,0 +1,20 @@ +/* AVX2 version of vectorized asinhf. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVeN16v_asinhf _ZGVeN16v_asinhf_avx2_wrapper +#include "../svml_s_asinhf16_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core.c new file mode 100644 index 0000000000..dc770a0e65 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized asinhf, vector length is 16. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVeN16v_asinhf +#include "ifunc-mathvec-avx512-skx.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVeN16v_asinhf, __GI__ZGVeN16v_asinhf, + __redirect__ZGVeN16v_asinhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core_avx512.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core_avx512.S new file mode 100644 index 0000000000..fc6a8e7cd3 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf16_core_avx512.S @@ -0,0 +1,476 @@ +/* Function asinhf vectorized with AVX-512. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute asinh(x) as log(x + sqrt(x*x + 1)) + * using RSQRT instructions for starting the + * square root approximation, and small table lookups for log + * that map to AVX-512 permute instructions + * + * Special cases: + * + * asinh(NaN) = quiet NaN, and raise invalid exception + * asinh(INF) = that INF + * asinh(0) = that 0 + * + */ + +/* Offsets for data table __svml_sasinh_data_internal_avx512 + */ +#define Log_tbl_H 0 +#define Log_tbl_L 128 +#define One 256 +#define AbsMask 320 +#define SmallThreshold 384 +#define Threshold 448 +#define LargeThreshold 512 +#define ca1 576 +#define c2s 640 +#define c1s 704 +#define AddB5 768 +#define RcpBitMask 832 +#define OneEighth 896 +#define Four 960 +#define poly_coeff3 1024 +#define poly_coeff2 1088 +#define poly_coeff1 1152 +#define L2H 1216 +#define L2L 1280 + +#include + + .text + .section .text.exex512,"ax",@progbits +ENTRY(_ZGVeN16v_asinhf_skx) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-64, %rsp + subq $192, %rsp + vmovaps %zmm0, %zmm10 + +/* x^2 */ + vmulps {rn-sae}, %zmm10, %zmm10, %zmm0 + vmovups One+__svml_sasinh_data_internal_avx512(%rip), %zmm2 + +/* polynomial computation for small inputs */ + vmovups ca1+__svml_sasinh_data_internal_avx512(%rip), %zmm1 + +/* not a very small input ? */ + vmovups SmallThreshold+__svml_sasinh_data_internal_avx512(%rip), %zmm11 + +/* 1+x^2 */ + vaddps {rn-sae}, %zmm2, %zmm0, %zmm7 + +/* |input| */ + vandps AbsMask+__svml_sasinh_data_internal_avx512(%rip), %zmm10, %zmm12 + +/* A=max(x^2, 1); */ + vmaxps {sae}, %zmm0, %zmm2, %zmm14 + vrsqrt14ps %zmm7, %zmm8 + +/* B=min(x^2, 1); */ + vminps {sae}, %zmm0, %zmm2, %zmm15 + vcmpps $21, {sae}, %zmm11, %zmm12, %k2 + +/* B_high */ + vsubps {rn-sae}, %zmm14, %zmm7, %zmm9 + +/* sign bit */ + vxorps %zmm10, %zmm12, %zmm13 + +/* Sh ~sqrt(1+x^2) */ + vmulps {rn-sae}, %zmm8, %zmm7, %zmm6 + vmovups LargeThreshold+__svml_sasinh_data_internal_avx512(%rip), %zmm14 + +/* B_low */ + vsubps {rn-sae}, %zmm9, %zmm15, %zmm3 + +/* Sh+x */ + vaddps {rn-sae}, %zmm12, %zmm6, %zmm15 + +/* (Yh*R0)_low */ + vfmsub213ps {rn-sae}, %zmm6, %zmm8, %zmm7 + vmulps {rn-sae}, %zmm1, %zmm0, %zmm9 + vcmpps $22, {sae}, %zmm14, %zmm12, %k0 + vmovups c1s+__svml_sasinh_data_internal_avx512(%rip), %zmm1 + +/* polynomial computation for small inputs */ + vfmadd213ps {rn-sae}, %zmm12, %zmm12, %zmm9 + kmovw %k0, %edx + +/* (x^2)_low */ + vmovaps %zmm10, %zmm4 + vfmsub213ps {rn-sae}, %zmm0, %zmm10, %zmm4 + +/* Yl = (x^2)_low + B_low */ + vaddps {rn-sae}, %zmm4, %zmm3, %zmm5 + +/* rel. error term: Eh=1-Sh*R0 */ + vmovaps %zmm2, %zmm0 + vfnmadd231ps {rn-sae}, %zmm6, %zmm8, %zmm0 + +/* Sl = (Yh*R0)_low+(R0*Yl) */ + vfmadd213ps {rn-sae}, %zmm7, %zmm8, %zmm5 + +/* very large inputs ? */ + vmovups Threshold+__svml_sasinh_data_internal_avx512(%rip), %zmm7 + +/* rel. error term: Eh=(1-Sh*R0)-Sl*R0 */ + vfnmadd231ps {rn-sae}, %zmm5, %zmm8, %zmm0 + +/* sqrt(1+x^2) ~ Sh + Sl + Sh*Eh*poly_s */ + vmovups c2s+__svml_sasinh_data_internal_avx512(%rip), %zmm8 + vcmpps $21, {sae}, %zmm7, %zmm12, %k1 + +/* Sh*Eh */ + vmulps {rn-sae}, %zmm0, %zmm6, %zmm4 + vfmadd231ps {rn-sae}, %zmm0, %zmm8, %zmm1 + +/* Sl + Sh*Eh*poly_s */ + vfmadd213ps {rn-sae}, %zmm5, %zmm1, %zmm4 + +/* Xh */ + vsubps {rn-sae}, %zmm6, %zmm15, %zmm5 + +/* fixup for very large inputs */ + vmovups OneEighth+__svml_sasinh_data_internal_avx512(%rip), %zmm6 + +/* Xin0+Sl+Sh*Eh*poly_s ~ x+sqrt(1+x^2) */ + vaddps {rn-sae}, %zmm4, %zmm15, %zmm3 + +/* Xl */ + vsubps {rn-sae}, %zmm5, %zmm12, %zmm5 + +/* Sl_high */ + vsubps {rn-sae}, %zmm15, %zmm3, %zmm0 + vmulps {rn-sae}, %zmm6, %zmm12, %zmm3{%k1} + +/* -K*L2H + Th */ + vmovups L2H+__svml_sasinh_data_internal_avx512(%rip), %zmm15 + +/* Sl_l */ + vsubps {rn-sae}, %zmm0, %zmm4, %zmm1 + vrcp14ps %zmm3, %zmm6 + +/* Table lookups */ + vmovups __svml_sasinh_data_internal_avx512(%rip), %zmm0 + +/* Xin_low */ + vaddps {rn-sae}, %zmm5, %zmm1, %zmm7 + +/* round reciprocal to 1+4b mantissas */ + vpaddd AddB5+__svml_sasinh_data_internal_avx512(%rip), %zmm6, %zmm4 + vmovups poly_coeff1+__svml_sasinh_data_internal_avx512(%rip), %zmm5 + vandps RcpBitMask+__svml_sasinh_data_internal_avx512(%rip), %zmm4, %zmm8 + +/* fixup for very large inputs */ + vxorps %zmm7, %zmm7, %zmm7{%k1} + +/* polynomial */ + vmovups poly_coeff3+__svml_sasinh_data_internal_avx512(%rip), %zmm4 + +/* reduced argument for log(): (Rcp*Xin-1)+Rcp*Xin_low */ + vfmsub231ps {rn-sae}, %zmm8, %zmm3, %zmm2 + vmovups Four+__svml_sasinh_data_internal_avx512(%rip), %zmm3 + +/* exponents */ + vgetexpps {sae}, %zmm8, %zmm1 + +/* Prepare table index */ + vpsrld $18, %zmm8, %zmm14 + vfmadd231ps {rn-sae}, %zmm8, %zmm7, %zmm2 + vmovups poly_coeff2+__svml_sasinh_data_internal_avx512(%rip), %zmm7 + vsubps {rn-sae}, %zmm3, %zmm1, %zmm1{%k1} + vpermt2ps Log_tbl_H+64+__svml_sasinh_data_internal_avx512(%rip), %zmm14, %zmm0 + vmovups Log_tbl_L+__svml_sasinh_data_internal_avx512(%rip), %zmm3 + vfmadd231ps {rn-sae}, %zmm2, %zmm4, %zmm7 + vfnmadd231ps {rn-sae}, %zmm1, %zmm15, %zmm0 + +/* R^2 */ + vmulps {rn-sae}, %zmm2, %zmm2, %zmm6 + vfmadd213ps {rn-sae}, %zmm5, %zmm2, %zmm7 + vpermt2ps Log_tbl_L+64+__svml_sasinh_data_internal_avx512(%rip), %zmm14, %zmm3 + +/* -K*L2L + Tl */ + vmovups L2L+__svml_sasinh_data_internal_avx512(%rip), %zmm14 + vfnmadd213ps {rn-sae}, %zmm3, %zmm14, %zmm1 + +/* Tl + R^2*Poly */ + vfmadd213ps {rn-sae}, %zmm1, %zmm6, %zmm7 + +/* R+Tl + R^2*Poly */ + vaddps {rn-sae}, %zmm2, %zmm7, %zmm2 + vaddps {rn-sae}, %zmm2, %zmm0, %zmm9{%k2} + vxorps %zmm13, %zmm9, %zmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx zmm0 zmm10 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %zmm10, 64(%rsp) + vmovups %zmm0, 128(%rsp) + # LOE rbx r12 r13 r14 r15 edx zmm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $16, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 128(%rsp), %zmm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -176; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x50, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -184; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x48, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -64; DW_OP_and; DW_OP_const4s: -192; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xc0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0x40, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 zmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 64(%rsp,%r14,4), %xmm0 + call asinhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 128(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVeN16v_asinhf_skx) + + .section .rodata, "a" + .align 64 + +#ifdef __svml_sasinh_data_internal_avx512_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(64)) VUINT32 Log_tbl_H[32][1]; + __declspec(align(64)) VUINT32 Log_tbl_L[32][1]; + __declspec(align(64)) VUINT32 One[16][1]; + __declspec(align(64)) VUINT32 AbsMask[16][1]; + __declspec(align(64)) VUINT32 SmallThreshold[16][1]; + __declspec(align(64)) VUINT32 Threshold[16][1]; + __declspec(align(64)) VUINT32 LargeThreshold[16][1]; + __declspec(align(64)) VUINT32 ca1[16][1]; + __declspec(align(64)) VUINT32 c2s[16][1]; + __declspec(align(64)) VUINT32 c1s[16][1]; + __declspec(align(64)) VUINT32 AddB5[16][1]; + __declspec(align(64)) VUINT32 RcpBitMask[16][1]; + __declspec(align(64)) VUINT32 OneEighth[16][1]; + __declspec(align(64)) VUINT32 Four[16][1]; + __declspec(align(64)) VUINT32 poly_coeff3[16][1]; + __declspec(align(64)) VUINT32 poly_coeff2[16][1]; + __declspec(align(64)) VUINT32 poly_coeff1[16][1]; + __declspec(align(64)) VUINT32 L2H[16][1]; + __declspec(align(64)) VUINT32 L2L[16][1]; + } __svml_sasinh_data_internal_avx512; +#endif +__svml_sasinh_data_internal_avx512: + /*== Log_tbl_H ==*/ + .long 0x00000000 + .long 0xbcfc0000 + .long 0xbd788000 + .long 0xbdb78000 + .long 0xbdf14000 + .long 0xbe14a000 + .long 0xbe300000 + .long 0xbe4aa000 + .long 0xbe648000 + .long 0xbe7dc000 + .long 0xbe8b4000 + .long 0xbe974000 + .long 0xbea31000 + .long 0xbeae9000 + .long 0xbeb9d000 + .long 0xbec4d000 + .long 0xbecfa000 + .long 0xbeda2000 + .long 0xbee48000 + .long 0xbeeea000 + .long 0xbef89000 + .long 0xbf012800 + .long 0xbf05f000 + .long 0xbf0aa800 + .long 0xbf0f4000 + .long 0xbf13c800 + .long 0xbf184000 + .long 0xbf1ca000 + .long 0xbf20f000 + .long 0xbf252800 + .long 0xbf295000 + .long 0xbf2d6800 + /*== Log_tbl_L ==*/ + .align 64 + .long 0x80000000 + .long 0xb726c39e + .long 0x3839e7fe + .long 0xb7528ae5 + .long 0x377891d5 + .long 0xb8297c10 + .long 0x37cf8f58 + .long 0x3852b186 + .long 0x35838656 + .long 0xb80c36af + .long 0x38235454 + .long 0xb862bae1 + .long 0x37e87bc7 + .long 0x37848150 + .long 0x37202511 + .long 0xb74e1b05 + .long 0x385c1340 + .long 0xb8777bcd + .long 0x36038656 + .long 0xb7d40984 + .long 0xb80f5faf + .long 0xb8254b4c + .long 0xb865c84a + .long 0x37f0b42d + .long 0xb83ebce1 + .long 0xb83c2513 + .long 0x37a332c4 + .long 0x3779654f + .long 0x38602f73 + .long 0x367449f8 + .long 0xb7b4996f + .long 0xb800986b + /*== One ==*/ + .align 64 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== AbsMask ==*/ + .align 64 + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== SmallThreshold ==*/ + .align 64 + .long 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000, 0x3c800000 + /*== Threshold ==*/ + .align 64 + .long 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000, 0x5f000000 + /*== LargeThreshold ==*/ + .align 64 + .long 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff, 0x7f7fffff + /*== ca1 ==*/ + .align 64 + .long 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE, 0xbe2AA5DE + /*== c2s ==*/ + .align 64 + .long 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000, 0x3ec00000 + /*== c1s ==*/ + .align 64 + .long 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000, 0x3f000000 + /*== AddB5 ==*/ + .align 64 + .long 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000, 0x00020000 + /*== RcpBitMask ==*/ + .align 64 + .long 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000, 0xfffc0000 + /*==OneEighth ==*/ + .align 64 + .long 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000, 0x3e000000 + /*== Four ==*/ + .align 64 + .long 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000, 0x40800000 + /*== poly_coeff3 ==*/ + .align 64 + .long 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810, 0xbe800810 + /*== poly_coeff2 ==*/ + .align 64 + .long 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e, 0x3eaab11e + /*== poly_coeff1 ==*/ + .align 64 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 + /*== L2H = log(2)_high ==*/ + .align 64 + .long 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000, 0x3f317000 + /*== L2L = log(2)_low ==*/ + .align 64 + .long 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4, 0x3805fdf4 + .align 64 + .type __svml_sasinh_data_internal_avx512,@object + .size __svml_sasinh_data_internal_avx512,.-__svml_sasinh_data_internal_avx512 diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core-sse2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core-sse2.S new file mode 100644 index 0000000000..52e4d2f728 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core-sse2.S @@ -0,0 +1,20 @@ +/* SSE2 version of vectorized asinhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVbN4v_asinhf _ZGVbN4v_asinhf_sse2 +#include "../svml_s_asinhf4_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core.c new file mode 100644 index 0000000000..296d5754ae --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized asinhf, vector length is 4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVbN4v_asinhf +#include "ifunc-mathvec-sse4_1.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVbN4v_asinhf, __GI__ZGVbN4v_asinhf, + __redirect__ZGVbN4v_asinhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core_sse4.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core_sse4.S new file mode 100644 index 0000000000..1eeeb4f5af --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf4_core_sse4.S @@ -0,0 +1,509 @@ +/* Function asinhf vectorized with SSE4. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute asinh(x) as log(x + sqrt(x*x + 1)) + * + * Special cases: + * + * asinh(NaN) = quiet NaN, and raise invalid exception + * asinh(INF) = that INF + * asinh(0) = that 0 + * + */ + +/* Offsets for data table __svml_sasinh_data_internal + */ +#define SgnMask 0 +#define sOne 16 +#define sPoly 32 +#define iBrkValue 160 +#define iOffExpoMask 176 +#define sBigThreshold 192 +#define sC2 208 +#define sC3 224 +#define sHalf 240 +#define sLargestFinite 256 +#define sLittleThreshold 272 +#define sSign 288 +#define sThirtyOne 304 +#define sTopMask11 320 +#define sTopMask8 336 +#define XScale 352 +#define sLn2 368 + +#include + + .text + .section .text.sse4,"ax",@progbits +ENTRY(_ZGVbN4v_asinhf_sse4) + subq $72, %rsp + cfi_def_cfa_offset(80) + movaps %xmm0, %xmm8 + +/* + * Split X into high and low parts, XHi (<= 11 bits) and XLo (<= 13 bits) + * We could use either X or |X| here, but it doesn't seem to matter + */ + movups sTopMask11+__svml_sasinh_data_internal(%rip), %xmm10 + movaps %xmm8, %xmm2 + andps %xmm8, %xmm10 + +/* + * Compute X^2 = (XHi + XLo)^2 = XHi^2 + XLo * (X + XHi) + * The two parts are shifted off by around 11 bits. So even though + * the low bit will not in general be exact, it's near enough + */ + movaps %xmm10, %xmm3 + subps %xmm10, %xmm2 + mulps %xmm10, %xmm3 + addps %xmm8, %xmm10 + +/* Load the constant 1 and a sign mask */ + movups sOne+__svml_sasinh_data_internal(%rip), %xmm7 + +/* + * Finally, express Y + W = X^2 + 1 accurately where Y has <= 8 bits. + * If |X| <= 1 then |XHi| <= 1 and so |X2Hi| <= 1, so we can treat 1 + * as the dominant component in the compensated summation. Otherwise, + * if |X| >= 1, then since X2Hi only has 22 significant bits, the basic + * addition will be exact anyway until we get to |X| >= 2^24. But by + * that time the log function is well-conditioned enough that the + * rounding error doesn't matter. Hence we can treat 1 as dominant even + * if it literally isn't. + */ + movaps %xmm7, %xmm11 + movaps %xmm7, %xmm4 + movups sTopMask8+__svml_sasinh_data_internal(%rip), %xmm12 + addps %xmm3, %xmm11 + mulps %xmm10, %xmm2 + subps %xmm11, %xmm4 + movaps %xmm12, %xmm0 + addps %xmm3, %xmm4 + +/* + * Unfortunately, we can still be in trouble if |X| <= 2^-5, since + * the absolute error 2^-(7+24)-ish in sqrt(1 + X^2) gets scaled up + * by 1/X and comes close to our threshold. Hence if |X| <= 2^-4, + * perform an alternative computation + * sqrt(1 + X^2) - 1 = X^2/2 - X^4/8 + X^6/16 + * X2 = X^2 + */ + addps %xmm2, %xmm3 + addps %xmm2, %xmm4 + andps %xmm11, %xmm0 + +/* + * Compute R = 1/sqrt(Y + W) * (1 + d) + * Force R to <= 8 significant bits. + * This means that R * Y and R^2 * Y are exactly representable. + */ + rsqrtps %xmm0, %xmm14 + subps %xmm0, %xmm11 + andps %xmm12, %xmm14 + addps %xmm11, %xmm4 + +/* + * Compute S = (Y/sqrt(Y + W)) * (1 + d) + * and T = (W/sqrt(Y + W)) * (1 + d) + * so that S + T = sqrt(Y + W) * (1 + d) + * S is exact, and the rounding error in T is OK. + */ + mulps %xmm14, %xmm0 + mulps %xmm14, %xmm4 + +/* + * Get the absolute value of the input, since we will exploit antisymmetry + * and mostly assume X >= 0 in the core computation + */ + movups SgnMask+__svml_sasinh_data_internal(%rip), %xmm6 + +/* + * Compute e = -(2 * d + d^2) + * The first FMR is exact, and the rounding error in the other is acceptable + * since d and e are ~ 2^-8 + */ + movaps %xmm14, %xmm13 + andps %xmm8, %xmm6 + +/* + * Obtain sqrt(1 + X^2) - 1 in two pieces + * sqrt(1 + X^2) - 1 + * = sqrt(Y + W) - 1 + * = (S + T) * (1 + Corr) - 1 + * = [S - 1] + [T + (S + T) * Corr] + * We need a compensated summation for the last part. We treat S - 1 + * as the larger part; it certainly is until about X < 2^-4, and in that + * case, the error is affordable since X dominates over sqrt(1 + X^2) - 1 + * Final sum is dTmp5 (hi) + dTmp7 (lo) + */ + movaps %xmm0, %xmm1 + +/* + * Check whether the input is finite, by checking |X| <= MaxFloat + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN <= MaxFloat) + */ + movaps %xmm6, %xmm9 + +/* + * The following computation can go wrong for very large X, basically + * because X^2 overflows. But for large X we have + * asinh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when do do this. + */ + movaps %xmm6, %xmm5 + cmpnleps sLargestFinite+__svml_sasinh_data_internal(%rip), %xmm9 + cmpltps sBigThreshold+__svml_sasinh_data_internal(%rip), %xmm5 + mulps %xmm0, %xmm13 + addps %xmm4, %xmm1 + subps %xmm7, %xmm0 + mulps %xmm4, %xmm14 + movmskps %xmm9, %edx + movaps %xmm7, %xmm9 + +/* + * Now 1 / (1 + d) + * = 1 / (1 + (sqrt(1 - e) - 1)) + * = 1 / sqrt(1 - e) + * = 1 + 1/2 * e + 3/8 * e^2 + 5/16 * e^3 + 35/128 * e^4 + ... + * So compute the first three nonconstant terms of that, so that + * we have a relative correction (1 + Corr) to apply to S etc. + * C1 = 1/2 + * C2 = 3/8 + * C3 = 5/16 + */ + movups sC3+__svml_sasinh_data_internal(%rip), %xmm15 + subps %xmm13, %xmm9 + movups sHalf+__svml_sasinh_data_internal(%rip), %xmm10 + subps %xmm14, %xmm9 + +/* sX2over2 = X^2/2 */ + mulps %xmm10, %xmm3 + mulps %xmm9, %xmm15 + +/* sX46 = -X^4/4 + X^6/8 */ + movaps %xmm3, %xmm2 + movaps %xmm3, %xmm12 + +/* + * Now do another compensated sum to add |X| + [sqrt(1 + X^2) - 1]. + * It's always safe to assume |X| is larger. + * This is the final 2-part argument to the log1p function + */ + movaps %xmm6, %xmm14 + addps sC2+__svml_sasinh_data_internal(%rip), %xmm15 + mulps %xmm9, %xmm15 + addps %xmm10, %xmm15 + mulps %xmm15, %xmm9 + mulps %xmm1, %xmm9 + +/* Now multiplex to the case X = 2^-30 * input, Xl = sL = 0 in the "big" case. */ + movups XScale+__svml_sasinh_data_internal(%rip), %xmm15 + addps %xmm9, %xmm4 + movaps %xmm4, %xmm11 + addps %xmm0, %xmm11 + subps %xmm11, %xmm0 + addps %xmm0, %xmm4 + +/* sX4over4 = X^4/4 */ + movaps %xmm3, %xmm0 + mulps %xmm3, %xmm0 + mulps %xmm0, %xmm2 + subps %xmm0, %xmm2 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + movaps %xmm7, %xmm0 + +/* sX46over2 = -X^4/8 + x^6/16 */ + mulps %xmm2, %xmm10 + movaps %xmm7, %xmm2 + addps %xmm10, %xmm12 + subps %xmm12, %xmm3 + addps %xmm3, %xmm10 + +/* Now multiplex the two possible computations */ + movaps %xmm6, %xmm3 + cmpleps sLittleThreshold+__svml_sasinh_data_internal(%rip), %xmm3 + movaps %xmm3, %xmm13 + andps %xmm3, %xmm12 + andnps %xmm11, %xmm13 + movaps %xmm3, %xmm1 + orps %xmm12, %xmm13 + andnps %xmm4, %xmm1 + andps %xmm3, %xmm10 + movaps %xmm6, %xmm4 + orps %xmm10, %xmm1 + addps %xmm13, %xmm14 + mulps %xmm15, %xmm6 + maxps %xmm14, %xmm0 + minps %xmm14, %xmm2 + subps %xmm14, %xmm4 + movaps %xmm0, %xmm3 + addps %xmm4, %xmm13 + addps %xmm2, %xmm3 + addps %xmm13, %xmm1 + subps %xmm3, %xmm0 + movaps %xmm5, %xmm4 + andps %xmm5, %xmm3 + andnps %xmm6, %xmm4 + addps %xmm0, %xmm2 + +/* + * Now resume the main code. + * reduction: compute r,n + */ + movdqu iBrkValue+__svml_sasinh_data_internal(%rip), %xmm6 + orps %xmm3, %xmm4 + psubd %xmm6, %xmm4 + movaps %xmm7, %xmm0 + addps %xmm2, %xmm1 + movdqu iOffExpoMask+__svml_sasinh_data_internal(%rip), %xmm2 + pand %xmm4, %xmm2 + psrad $23, %xmm4 + cvtdq2ps %xmm4, %xmm3 + pslld $23, %xmm4 + andps %xmm5, %xmm1 + paddd %xmm6, %xmm2 + psubd %xmm4, %xmm0 + mulps %xmm0, %xmm1 + +/* polynomial evaluation */ + subps %xmm7, %xmm2 + movups sPoly+112+__svml_sasinh_data_internal(%rip), %xmm7 + addps %xmm2, %xmm1 + mulps %xmm1, %xmm7 + movaps %xmm5, %xmm2 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + movups sThirtyOne+__svml_sasinh_data_internal(%rip), %xmm0 + addps sPoly+96+__svml_sasinh_data_internal(%rip), %xmm7 + addps %xmm3, %xmm0 + mulps %xmm1, %xmm7 + andnps %xmm0, %xmm2 + andps %xmm5, %xmm3 + orps %xmm3, %xmm2 + addps sPoly+80+__svml_sasinh_data_internal(%rip), %xmm7 + +/* final reconstruction */ + mulps sLn2+__svml_sasinh_data_internal(%rip), %xmm2 + mulps %xmm1, %xmm7 + +/* Finally, reincorporate the original sign. */ + movups sSign+__svml_sasinh_data_internal(%rip), %xmm0 + andps %xmm8, %xmm0 + addps sPoly+64+__svml_sasinh_data_internal(%rip), %xmm7 + mulps %xmm1, %xmm7 + addps sPoly+48+__svml_sasinh_data_internal(%rip), %xmm7 + mulps %xmm1, %xmm7 + addps sPoly+32+__svml_sasinh_data_internal(%rip), %xmm7 + mulps %xmm1, %xmm7 + addps sPoly+16+__svml_sasinh_data_internal(%rip), %xmm7 + mulps %xmm1, %xmm7 + addps sPoly+__svml_sasinh_data_internal(%rip), %xmm7 + mulps %xmm1, %xmm7 + mulps %xmm1, %xmm7 + addps %xmm7, %xmm1 + addps %xmm2, %xmm1 + pxor %xmm1, %xmm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx rbp r12 r13 r14 r15 edx xmm0 xmm8 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + addq $72, %rsp + cfi_def_cfa_offset(8) + ret + cfi_def_cfa_offset(80) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + movups %xmm8, 32(%rsp) + movups %xmm0, 48(%rsp) + # LOE rbx rbp r12 r13 r14 r15 edx + + xorl %eax, %eax + movq %r12, 16(%rsp) + cfi_offset(12, -64) + movl %eax, %r12d + movq %r13, 8(%rsp) + cfi_offset(13, -72) + movl %edx, %r13d + movq %r14, (%rsp) + cfi_offset(14, -80) + # LOE rbx rbp r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx rbp r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $4, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx rbp r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + movups 48(%rsp), %xmm0 + +/* Go to exit */ + jmp L(EXIT) + cfi_offset(12, -64) + cfi_offset(13, -72) + cfi_offset(14, -80) + # LOE rbx rbp r12 r13 r14 r15 xmm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call asinhf@PLT + # LOE rbx rbp r14 r15 r12d r13d xmm0 + + movss %xmm0, 48(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx rbp r15 r12d r13d +END(_ZGVbN4v_asinhf_sse4) + + .section .rodata, "a" + .align 16 + +#ifdef __svml_sasinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(16)) VUINT32 SgnMask[4][1]; + __declspec(align(16)) VUINT32 sOne[4][1]; + __declspec(align(16)) VUINT32 sPoly[8][4][1]; + __declspec(align(16)) VUINT32 iBrkValue[4][1]; + __declspec(align(16)) VUINT32 iOffExpoMask[4][1]; + __declspec(align(16)) VUINT32 sBigThreshold[4][1]; + __declspec(align(16)) VUINT32 sC2[4][1]; + __declspec(align(16)) VUINT32 sC3[4][1]; + __declspec(align(16)) VUINT32 sHalf[4][1]; + __declspec(align(16)) VUINT32 sLargestFinite[4][1]; + __declspec(align(16)) VUINT32 sLittleThreshold[4][1]; + __declspec(align(16)) VUINT32 sSign[4][1]; + __declspec(align(16)) VUINT32 sThirtyOne[4][1]; + __declspec(align(16)) VUINT32 sTopMask11[4][1]; + __declspec(align(16)) VUINT32 sTopMask8[4][1]; + __declspec(align(16)) VUINT32 XScale[4][1]; + __declspec(align(16)) VUINT32 sLn2[4][1]; +} __svml_sasinh_data_internal; +#endif +__svml_sasinh_data_internal: + /*== SgnMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== sOne = SP 1.0 ==*/ + .align 16 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 16 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iBrkValue = SP 2/3 ==*/ + .align 16 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 16 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sBigThreshold ==*/ + .align 16 + .long 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000 + /*== sC2 ==*/ + .align 16 + .long 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000 + /*== sC3 ==*/ + .align 16 + .long 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000 + /*== sHalf ==*/ + .align 16 + .long 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000 + /*== sLargestFinite ==*/ + .align 16 + .long 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF + /*== sLittleThreshold ==*/ + .align 16 + .long 0x3D800000, 0x3D800000, 0x3D800000, 0x3D800000 + /*== sSign ==*/ + .align 16 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000 + /*== sThirtyOne ==*/ + .align 16 + .long 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000 + /*== sTopMask11 ==*/ + .align 16 + .long 0xFFFFE000, 0xFFFFE000, 0xFFFFE000, 0xFFFFE000 + /*== sTopMask8 ==*/ + .align 16 + .long 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000 + /*== XScale ==*/ + .align 16 + .long 0x30800000, 0x30800000, 0x30800000, 0x30800000 + /*== sLn2 = SP ln(2) ==*/ + .align 16 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 16 + .type __svml_sasinh_data_internal,@object + .size __svml_sasinh_data_internal,.-__svml_sasinh_data_internal diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core-sse.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core-sse.S new file mode 100644 index 0000000000..1a0e113e94 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core-sse.S @@ -0,0 +1,20 @@ +/* SSE version of vectorized asinhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define _ZGVdN8v_asinhf _ZGVdN8v_asinhf_sse_wrapper +#include "../svml_s_asinhf8_core.S" diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core.c b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core.c new file mode 100644 index 0000000000..d97097a394 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core.c @@ -0,0 +1,28 @@ +/* Multiple versions of vectorized asinhf, vector length is 8. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define SYMBOL_NAME _ZGVdN8v_asinhf +#include "ifunc-mathvec-avx2.h" + +libc_ifunc_redirected (REDIRECT_NAME, SYMBOL_NAME, IFUNC_SELECTOR ()); + +#ifdef SHARED +__hidden_ver1 (_ZGVdN8v_asinhf, __GI__ZGVdN8v_asinhf, + __redirect__ZGVdN8v_asinhf) + __attribute__ ((visibility ("hidden"))); +#endif diff --git a/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core_avx2.S b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core_avx2.S new file mode 100644 index 0000000000..a966f53773 --- /dev/null +++ b/sysdeps/x86_64/fpu/multiarch/svml_s_asinhf8_core_avx2.S @@ -0,0 +1,457 @@ +/* Function asinhf vectorized with AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + https://www.gnu.org/licenses/. */ + +/* + * ALGORITHM DESCRIPTION: + * + * Compute asinh(x) as log(x + sqrt(x*x + 1)) + * + * Special cases: + * + * asinh(NaN) = quiet NaN, and raise invalid exception + * asinh(INF) = that INF + * asinh(0) = that 0 + * + */ + +/* Offsets for data table __svml_sasinh_data_internal + */ +#define SgnMask 0 +#define sOne 32 +#define sPoly 64 +#define iBrkValue 320 +#define iOffExpoMask 352 +#define sBigThreshold 384 +#define sC2 416 +#define sC3 448 +#define sHalf 480 +#define sLargestFinite 512 +#define sLittleThreshold 544 +#define sSign 576 +#define sThirtyOne 608 +#define sTopMask8 640 +#define XScale 672 +#define sLn2 704 + +#include + + .text + .section .text.avx2,"ax",@progbits +ENTRY(_ZGVdN8v_asinhf_avx2) + pushq %rbp + cfi_def_cfa_offset(16) + movq %rsp, %rbp + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + andq $-32, %rsp + subq $96, %rsp + vmovaps %ymm0, %ymm9 + +/* Load the constant 1 and a sign mask */ + vmovups sOne+__svml_sasinh_data_internal(%rip), %ymm8 + +/* No need to split X when FMA is available in hardware. */ + vmulps %ymm9, %ymm9, %ymm5 + vmovups sTopMask8+__svml_sasinh_data_internal(%rip), %ymm1 + +/* + * Finally, express Y + W = X^2 + 1 accurately where Y has <= 8 bits. + * If |X| <= 1 then |XHi| <= 1 and so |X2Hi| <= 1, so we can treat 1 + * as the dominant component in the compensated summation. Otherwise, + * if |X| >= 1, then since X2Hi only has 22 significant bits, the basic + * addition will be exact anyway until we get to |X| >= 2^24. But by + * that time the log function is well-conditioned enough that the + * rounding error doesn't matter. Hence we can treat 1 as dominant even + * if it literally isn't. + */ + vaddps %ymm5, %ymm8, %ymm13 + vandps %ymm1, %ymm13, %ymm2 + vmovaps %ymm9, %ymm4 + vsubps %ymm13, %ymm8, %ymm11 + vsubps %ymm2, %ymm13, %ymm15 + +/* + * Compute R = 1/sqrt(Y + W) * (1 + d) + * Force R to <= 8 significant bits. + * This means that R * Y and R^2 * Y are exactly representable. + */ + vrsqrtps %ymm2, %ymm0 + vfmsub213ps %ymm5, %ymm9, %ymm4 + vaddps %ymm11, %ymm5, %ymm12 + +/* + * Get the absolute value of the input, since we will exploit antisymmetry + * and mostly assume X >= 0 in the core computation + */ + vandps SgnMask+__svml_sasinh_data_internal(%rip), %ymm9, %ymm6 + +/* + * Check whether the input is finite, by checking |X| <= MaxFloat + * Otherwise set the rangemask so that the callout will get used. + * Note that this will also use the callout for NaNs since not(NaN <= MaxFloat) + */ + vcmpnle_uqps sLargestFinite+__svml_sasinh_data_internal(%rip), %ymm6, %ymm10 + vaddps %ymm12, %ymm4, %ymm14 + +/* + * Unfortunately, we can still be in trouble if |X| <= 2^-5, since + * the absolute error 2^-(7+24)-ish in sqrt(1 + X^2) gets scaled up + * by 1/X and comes close to our threshold. Hence if |X| <= 2^-4, + * perform an alternative computation + * sqrt(1 + X^2) - 1 = X^2/2 - X^4/8 + X^6/16 + * X2 = X^2 + */ + vaddps %ymm4, %ymm5, %ymm4 + +/* + * The following computation can go wrong for very large X, basically + * because X^2 overflows. But for large X we have + * asinh(X) / log(2 X) - 1 =~= 1/(4 * X^2), so for X >= 2^30 + * we can just later stick X back into the log and tweak up the exponent. + * Actually we scale X by 2^-30 and tweak the exponent up by 31, + * to stay in the safe range for the later log computation. + * Compute a flag now telling us when do do this. + */ + vcmplt_oqps sBigThreshold+__svml_sasinh_data_internal(%rip), %ymm6, %ymm7 + vaddps %ymm15, %ymm14, %ymm3 + +/* + * Now 1 / (1 + d) + * = 1 / (1 + (sqrt(1 - e) - 1)) + * = 1 / sqrt(1 - e) + * = 1 + 1/2 * e + 3/8 * e^2 + 5/16 * e^3 + 35/128 * e^4 + ... + * So compute the first three nonconstant terms of that, so that + * we have a relative correction (1 + Corr) to apply to S etc. + * C1 = 1/2 + * C2 = 3/8 + * C3 = 5/16 + */ + vmovups sC3+__svml_sasinh_data_internal(%rip), %ymm12 + vmovmskps %ymm10, %edx + vandps %ymm1, %ymm0, %ymm10 + +/* + * Compute S = (Y/sqrt(Y + W)) * (1 + d) + * and T = (W/sqrt(Y + W)) * (1 + d) + * so that S + T = sqrt(Y + W) * (1 + d) + * S is exact, and the rounding error in T is OK. + */ + vmulps %ymm10, %ymm2, %ymm15 + vmulps %ymm3, %ymm10, %ymm14 + vmovups sHalf+__svml_sasinh_data_internal(%rip), %ymm3 + vsubps %ymm8, %ymm15, %ymm0 + +/* + * Obtain sqrt(1 + X^2) - 1 in two pieces + * sqrt(1 + X^2) - 1 + * = sqrt(Y + W) - 1 + * = (S + T) * (1 + Corr) - 1 + * = [S - 1] + [T + (S + T) * Corr] + * We need a compensated summation for the last part. We treat S - 1 + * as the larger part; it certainly is until about X < 2^-4, and in that + * case, the error is affordable since X dominates over sqrt(1 + X^2) - 1 + * Final sum is dTmp5 (hi) + dTmp7 (lo) + */ + vaddps %ymm14, %ymm15, %ymm13 + +/* + * Compute e = -(2 * d + d^2) + * The first FMR is exact, and the rounding error in the other is acceptable + * since d and e are ~ 2^-8 + */ + vmovaps %ymm8, %ymm11 + vfnmadd231ps %ymm15, %ymm10, %ymm11 + vfnmadd231ps %ymm14, %ymm10, %ymm11 + vfmadd213ps sC2+__svml_sasinh_data_internal(%rip), %ymm11, %ymm12 + vfmadd213ps %ymm3, %ymm11, %ymm12 + vmulps %ymm12, %ymm11, %ymm1 + +/* Now multiplex the two possible computations */ + vcmple_oqps sLittleThreshold+__svml_sasinh_data_internal(%rip), %ymm6, %ymm11 + vfmadd213ps %ymm14, %ymm13, %ymm1 + vaddps %ymm0, %ymm1, %ymm2 + vsubps %ymm2, %ymm0, %ymm10 + +/* sX2over2 = X^2/2 */ + vmulps %ymm4, %ymm3, %ymm0 + vaddps %ymm10, %ymm1, %ymm1 + +/* sX4over4 = X^4/4 */ + vmulps %ymm0, %ymm0, %ymm5 + +/* sX46 = -X^4/4 + X^6/8 */ + vfmsub231ps %ymm0, %ymm5, %ymm5 + +/* sX46over2 = -X^4/8 + x^6/16 */ + vmulps %ymm5, %ymm3, %ymm3 + vaddps %ymm3, %ymm0, %ymm5 + vblendvps %ymm11, %ymm5, %ymm2, %ymm2 + vsubps %ymm5, %ymm0, %ymm4 + +/* + * Now do another compensated sum to add |X| + [sqrt(1 + X^2) - 1]. + * It's always safe to assume |X| is larger. + * This is the final 2-part argument to the log1p function + */ + vaddps %ymm2, %ymm6, %ymm14 + +/* + * Now resume the main code. + * reduction: compute r,n + */ + vmovups iBrkValue+__svml_sasinh_data_internal(%rip), %ymm5 + vaddps %ymm4, %ymm3, %ymm10 + +/* + * Now we feed into the log1p code, using H in place of _VARG1 and + * also adding L into Xl. + * compute 1+x as high, low parts + */ + vmaxps %ymm14, %ymm8, %ymm15 + vminps %ymm14, %ymm8, %ymm0 + vblendvps %ymm11, %ymm10, %ymm1, %ymm12 + vsubps %ymm14, %ymm6, %ymm1 + vaddps %ymm0, %ymm15, %ymm3 + +/* Now multiplex to the case X = 2^-30 * input, Xl = sL = 0 in the "big" case. */ + vmulps XScale+__svml_sasinh_data_internal(%rip), %ymm6, %ymm6 + vaddps %ymm1, %ymm2, %ymm13 + vsubps %ymm3, %ymm15, %ymm15 + vaddps %ymm13, %ymm12, %ymm1 + vaddps %ymm15, %ymm0, %ymm2 + vblendvps %ymm7, %ymm3, %ymm6, %ymm0 + vaddps %ymm2, %ymm1, %ymm4 + vpsubd %ymm5, %ymm0, %ymm1 + vpsrad $23, %ymm1, %ymm6 + vpand iOffExpoMask+__svml_sasinh_data_internal(%rip), %ymm1, %ymm2 + vmovups sPoly+224+__svml_sasinh_data_internal(%rip), %ymm1 + vpslld $23, %ymm6, %ymm10 + vpaddd %ymm5, %ymm2, %ymm13 + vcvtdq2ps %ymm6, %ymm0 + vpsubd %ymm10, %ymm8, %ymm12 + +/* polynomial evaluation */ + vsubps %ymm8, %ymm13, %ymm8 + +/* Add 31 to the exponent in the "large" case to get log(2 * input) */ + vaddps sThirtyOne+__svml_sasinh_data_internal(%rip), %ymm0, %ymm3 + vandps %ymm7, %ymm4, %ymm11 + vmulps %ymm12, %ymm11, %ymm14 + vblendvps %ymm7, %ymm0, %ymm3, %ymm0 + vaddps %ymm8, %ymm14, %ymm2 + vfmadd213ps sPoly+192+__svml_sasinh_data_internal(%rip), %ymm2, %ymm1 + vfmadd213ps sPoly+160+__svml_sasinh_data_internal(%rip), %ymm2, %ymm1 + vfmadd213ps sPoly+128+__svml_sasinh_data_internal(%rip), %ymm2, %ymm1 + vfmadd213ps sPoly+96+__svml_sasinh_data_internal(%rip), %ymm2, %ymm1 + vfmadd213ps sPoly+64+__svml_sasinh_data_internal(%rip), %ymm2, %ymm1 + vfmadd213ps sPoly+32+__svml_sasinh_data_internal(%rip), %ymm2, %ymm1 + vfmadd213ps sPoly+__svml_sasinh_data_internal(%rip), %ymm2, %ymm1 + vmulps %ymm1, %ymm2, %ymm4 + vfmadd213ps %ymm2, %ymm2, %ymm4 + +/* final reconstruction */ + vfmadd132ps sLn2+__svml_sasinh_data_internal(%rip), %ymm4, %ymm0 + +/* Finally, reincorporate the original sign. */ + vandps sSign+__svml_sasinh_data_internal(%rip), %ymm9, %ymm7 + vxorps %ymm0, %ymm7, %ymm0 + testl %edx, %edx + +/* Go to special inputs processing branch */ + jne L(SPECIAL_VALUES_BRANCH) + # LOE rbx r12 r13 r14 r15 edx ymm0 ymm9 + +/* Restore registers + * and exit the function + */ + +L(EXIT): + movq %rbp, %rsp + popq %rbp + cfi_def_cfa(7, 8) + cfi_restore(6) + ret + cfi_def_cfa(6, 16) + cfi_offset(6, -16) + +/* Branch to process + * special inputs + */ + +L(SPECIAL_VALUES_BRANCH): + vmovups %ymm9, 32(%rsp) + vmovups %ymm0, 64(%rsp) + # LOE rbx r12 r13 r14 r15 edx ymm0 + + xorl %eax, %eax + # LOE rbx r12 r13 r14 r15 eax edx + + vzeroupper + movq %r12, 16(%rsp) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + movl %eax, %r12d + movq %r13, 8(%rsp) + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + movl %edx, %r13d + movq %r14, (%rsp) + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r15 r12d r13d + +/* Range mask + * bits check + */ + +L(RANGEMASK_CHECK): + btl %r12d, %r13d + +/* Call scalar math function */ + jc L(SCALAR_MATH_CALL) + # LOE rbx r15 r12d r13d + +/* Special inputs + * processing loop + */ + +L(SPECIAL_VALUES_LOOP): + incl %r12d + cmpl $8, %r12d + +/* Check bits in range mask */ + jl L(RANGEMASK_CHECK) + # LOE rbx r15 r12d r13d + + movq 16(%rsp), %r12 + cfi_restore(12) + movq 8(%rsp), %r13 + cfi_restore(13) + movq (%rsp), %r14 + cfi_restore(14) + vmovups 64(%rsp), %ymm0 + +/* Go to exit */ + jmp L(EXIT) + /* DW_CFA_expression: r12 (r12) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -80; DW_OP_plus) */ + .cfi_escape 0x10, 0x0c, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xb0, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r13 (r13) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -88; DW_OP_plus) */ + .cfi_escape 0x10, 0x0d, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa8, 0xff, 0xff, 0xff, 0x22 + /* DW_CFA_expression: r14 (r14) (DW_OP_lit8; DW_OP_minus; DW_OP_const4s: -32; DW_OP_and; DW_OP_const4s: -96; DW_OP_plus) */ + .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0xe0, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xa0, 0xff, 0xff, 0xff, 0x22 + # LOE rbx r12 r13 r14 r15 ymm0 + +/* Scalar math fucntion call + * to process special input + */ + +L(SCALAR_MATH_CALL): + movl %r12d, %r14d + movss 32(%rsp,%r14,4), %xmm0 + call asinhf@PLT + # LOE rbx r14 r15 r12d r13d xmm0 + + movss %xmm0, 64(%rsp,%r14,4) + +/* Process special inputs in loop */ + jmp L(SPECIAL_VALUES_LOOP) + # LOE rbx r15 r12d r13d +END(_ZGVdN8v_asinhf_avx2) + + .section .rodata, "a" + .align 32 + +#ifdef __svml_sasinh_data_internal_typedef +typedef unsigned int VUINT32; +typedef struct { + __declspec(align(32)) VUINT32 SgnMask[8][1]; + __declspec(align(32)) VUINT32 sOne[8][1]; + __declspec(align(32)) VUINT32 sPoly[8][8][1]; + __declspec(align(32)) VUINT32 iBrkValue[8][1]; + __declspec(align(32)) VUINT32 iOffExpoMask[8][1]; + __declspec(align(32)) VUINT32 sBigThreshold[8][1]; + __declspec(align(32)) VUINT32 sC2[8][1]; + __declspec(align(32)) VUINT32 sC3[8][1]; + __declspec(align(32)) VUINT32 sHalf[8][1]; + __declspec(align(32)) VUINT32 sLargestFinite[8][1]; + __declspec(align(32)) VUINT32 sLittleThreshold[8][1]; + __declspec(align(32)) VUINT32 sSign[8][1]; + __declspec(align(32)) VUINT32 sThirtyOne[8][1]; + __declspec(align(32)) VUINT32 sTopMask8[8][1]; + __declspec(align(32)) VUINT32 XScale[8][1]; + __declspec(align(32)) VUINT32 sLn2[8][1]; +} __svml_sasinh_data_internal; +#endif +__svml_sasinh_data_internal: + /*== SgnMask ==*/ + .long 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff + /*== sOne = SP 1.0 ==*/ + .align 32 + .long 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000 + /*== sPoly[] = SP polynomial ==*/ + .align 32 + .long 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000, 0xbf000000 /* -5.0000000000000000000000000e-01 P0 */ + .long 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94, 0x3eaaaa94 /* 3.3333265781402587890625000e-01 P1 */ + .long 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e, 0xbe80058e /* -2.5004237890243530273437500e-01 P2 */ + .long 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190, 0x3e4ce190 /* 2.0007920265197753906250000e-01 P3 */ + .long 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37, 0xbe28ad37 /* -1.6472326219081878662109375e-01 P4 */ + .long 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12, 0x3e0fcb12 /* 1.4042308926582336425781250e-01 P5 */ + .long 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3, 0xbe1ad9e3 /* -1.5122179687023162841796875e-01 P6 */ + .long 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed, 0x3e0d84ed /* 1.3820238411426544189453125e-01 P7 */ + /*== iBrkValue = SP 2/3 ==*/ + .align 32 + .long 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab + /*== iOffExpoMask = SP significand mask ==*/ + .align 32 + .long 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff + /*== sBigThreshold ==*/ + .align 32 + .long 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000, 0x4E800000 + /*== sC2 ==*/ + .align 32 + .long 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000, 0x3EC00000 + /*== sC3 ==*/ + .align 32 + .long 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000, 0x3EA00000 + /*== sHalf ==*/ + .align 32 + .long 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000 + /*== sLargestFinite ==*/ + .align 32 + .long 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF, 0x7F7FFFFF + /*== sLittleThreshold ==*/ + .align 32 + .long 0x3D800000, 0x3D800000, 0x3D800000, 0x3D800000, 0x3D800000, 0x3D800000, 0x3D800000, 0x3D800000 + /*== sSign ==*/ + .align 32 + .long 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000, 0x80000000 + /*== sThirtyOne ==*/ + .align 32 + .long 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000, 0x41F80000 + /*== sTopMask8 ==*/ + .align 32 + .long 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000, 0xFFFF0000 + /*== XScale ==*/ + .align 32 + .long 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000, 0x30800000 + /*== sLn2 = SP ln(2) ==*/ + .align 32 + .long 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218, 0x3f317218 + .align 32 + .type __svml_sasinh_data_internal,@object + .size __svml_sasinh_data_internal,.-__svml_sasinh_data_internal diff --git a/sysdeps/x86_64/fpu/svml_d_asinh2_core.S b/sysdeps/x86_64/fpu/svml_d_asinh2_core.S new file mode 100644 index 0000000000..60e372238a --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asinh2_core.S @@ -0,0 +1,29 @@ +/* Function asinh vectorized with SSE2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVbN2v_asinh) +WRAPPER_IMPL_SSE2 asinh +END (_ZGVbN2v_asinh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN2v_asinh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_asinh4_core.S b/sysdeps/x86_64/fpu/svml_d_asinh4_core.S new file mode 100644 index 0000000000..c7350011e1 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asinh4_core.S @@ -0,0 +1,29 @@ +/* Function asinh vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVdN4v_asinh) +WRAPPER_IMPL_AVX _ZGVbN2v_asinh +END (_ZGVdN4v_asinh) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN4v_asinh) +#endif diff --git a/sysdeps/x86_64/fpu/svml_d_asinh4_core_avx.S b/sysdeps/x86_64/fpu/svml_d_asinh4_core_avx.S new file mode 100644 index 0000000000..83aaa8c3f1 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asinh4_core_avx.S @@ -0,0 +1,25 @@ +/* Function asinh vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVcN4v_asinh) +WRAPPER_IMPL_AVX _ZGVbN2v_asinh +END (_ZGVcN4v_asinh) diff --git a/sysdeps/x86_64/fpu/svml_d_asinh8_core.S b/sysdeps/x86_64/fpu/svml_d_asinh8_core.S new file mode 100644 index 0000000000..9597975ff6 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_d_asinh8_core.S @@ -0,0 +1,25 @@ +/* Function asinh vectorized with AVX-512, wrapper to AVX2. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_d_wrapper_impl.h" + + .text +ENTRY (_ZGVeN8v_asinh) +WRAPPER_IMPL_AVX512 _ZGVdN4v_asinh +END (_ZGVeN8v_asinh) diff --git a/sysdeps/x86_64/fpu/svml_s_asinhf16_core.S b/sysdeps/x86_64/fpu/svml_s_asinhf16_core.S new file mode 100644 index 0000000000..5b3d405f2e --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinhf16_core.S @@ -0,0 +1,25 @@ +/* Function asinhf vectorized with AVX-512. Wrapper to AVX2 version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVeN16v_asinhf) +WRAPPER_IMPL_AVX512 _ZGVdN8v_asinhf +END (_ZGVeN16v_asinhf) diff --git a/sysdeps/x86_64/fpu/svml_s_asinhf4_core.S b/sysdeps/x86_64/fpu/svml_s_asinhf4_core.S new file mode 100644 index 0000000000..af44fa5108 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinhf4_core.S @@ -0,0 +1,29 @@ +/* Function asinhf vectorized with SSE2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVbN4v_asinhf) +WRAPPER_IMPL_SSE2 asinhf +END (_ZGVbN4v_asinhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVbN4v_asinhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_asinhf8_core.S b/sysdeps/x86_64/fpu/svml_s_asinhf8_core.S new file mode 100644 index 0000000000..3bd06d8032 --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinhf8_core.S @@ -0,0 +1,29 @@ +/* Function asinhf vectorized with AVX2, wrapper version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVdN8v_asinhf) +WRAPPER_IMPL_AVX _ZGVbN4v_asinhf +END (_ZGVdN8v_asinhf) + +#ifndef USE_MULTIARCH + libmvec_hidden_def (_ZGVdN8v_asinhf) +#endif diff --git a/sysdeps/x86_64/fpu/svml_s_asinhf8_core_avx.S b/sysdeps/x86_64/fpu/svml_s_asinhf8_core_avx.S new file mode 100644 index 0000000000..f79616c0bd --- /dev/null +++ b/sysdeps/x86_64/fpu/svml_s_asinhf8_core_avx.S @@ -0,0 +1,25 @@ +/* Function asinhf vectorized in AVX ISA as wrapper to SSE4 ISA version. + Copyright (C) 2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include "svml_s_wrapper_impl.h" + + .text +ENTRY (_ZGVcN8v_asinhf) +WRAPPER_IMPL_AVX _ZGVbN4v_asinhf +END (_ZGVcN8v_asinhf) diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx.c b/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx.c new file mode 100644 index 0000000000..da03528700 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx.c @@ -0,0 +1 @@ +#include "test-double-libmvec-asinh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx2.c b/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx2.c new file mode 100644 index 0000000000..da03528700 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx2.c @@ -0,0 +1 @@ +#include "test-double-libmvec-asinh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx512f.c b/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx512f.c new file mode 100644 index 0000000000..da03528700 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asinh-avx512f.c @@ -0,0 +1 @@ +#include "test-double-libmvec-asinh.c" diff --git a/sysdeps/x86_64/fpu/test-double-libmvec-asinh.c b/sysdeps/x86_64/fpu/test-double-libmvec-asinh.c new file mode 100644 index 0000000000..71e6b9f578 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-double-libmvec-asinh.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE double +#define LIBMVEC_FUNC asinh +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c index f53bb6813e..76114772ba 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen2-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVbN2v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVbN2v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVbN2v_erf) VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVbN2v_tanh) +VECTOR_WRAPPER (WRAPPER_NAME (asinh), _ZGVbN2v_asinh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c index 0452c3db38..1e0ee34975 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-avx2-wrappers.c @@ -48,6 +48,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVdN4v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVdN4v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVdN4v_erf) VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVdN4v_tanh) +VECTOR_WRAPPER (WRAPPER_NAME (asinh), _ZGVdN4v_asinh) #ifndef __ILP32__ # define VEC_INT_TYPE __m256i diff --git a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c index 197d5afc88..17c43a75d1 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen4-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVcN4v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVcN4v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVcN4v_erf) VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVcN4v_tanh) +VECTOR_WRAPPER (WRAPPER_NAME (asinh), _ZGVcN4v_asinh) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c index e56ece640c..1c6809e6e3 100644 --- a/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-double-vlen8-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanh), _ZGVeN8v_atanh) VECTOR_WRAPPER (WRAPPER_NAME (acosh), _ZGVeN8v_acosh) VECTOR_WRAPPER (WRAPPER_NAME (erf), _ZGVeN8v_erf) VECTOR_WRAPPER (WRAPPER_NAME (tanh), _ZGVeN8v_tanh) +VECTOR_WRAPPER (WRAPPER_NAME (asinh), _ZGVeN8v_asinh) #ifndef __ILP32__ # define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx.c new file mode 100644 index 0000000000..77e1838bb4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx.c @@ -0,0 +1 @@ +#include "test-float-libmvec-asinhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx2.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx2.c new file mode 100644 index 0000000000..77e1838bb4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx2.c @@ -0,0 +1 @@ +#include "test-float-libmvec-asinhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx512f.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx512f.c new file mode 100644 index 0000000000..77e1838bb4 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf-avx512f.c @@ -0,0 +1 @@ +#include "test-float-libmvec-asinhf.c" diff --git a/sysdeps/x86_64/fpu/test-float-libmvec-asinhf.c b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf.c new file mode 100644 index 0000000000..3353754102 --- /dev/null +++ b/sysdeps/x86_64/fpu/test-float-libmvec-asinhf.c @@ -0,0 +1,3 @@ +#define LIBMVEC_TYPE float +#define LIBMVEC_FUNC asinhf +#include "test-vector-abi-arg1.h" diff --git a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c index abbebf9993..e8ab1885a7 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen16-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVeN16v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVeN16v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVeN16v_erff) VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVeN16v_tanhf) +VECTOR_WRAPPER (WRAPPER_NAME (asinhf), _ZGVeN16v_asinhf) #define VEC_INT_TYPE __m512i diff --git a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c index ae1c8b98c2..a80c5387e4 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen4-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVbN4v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVbN4v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVbN4v_erff) VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVbN4v_tanhf) +VECTOR_WRAPPER (WRAPPER_NAME (asinhf), _ZGVbN4v_asinhf) #define VEC_INT_TYPE __m128i diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c index eb477a0371..c3d1d5936b 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-avx2-wrappers.c @@ -48,6 +48,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVdN8v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVdN8v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVdN8v_erff) VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVdN8v_tanhf) +VECTOR_WRAPPER (WRAPPER_NAME (asinhf), _ZGVdN8v_asinhf) /* Redefinition of wrapper to be compatible with _ZGVdN8vvv_sincosf. */ #undef VECTOR_WRAPPER_fFF diff --git a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c index 944f7f0a75..b7da0f523b 100644 --- a/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c +++ b/sysdeps/x86_64/fpu/test-float-vlen8-wrappers.c @@ -45,6 +45,7 @@ VECTOR_WRAPPER (WRAPPER_NAME (atanhf), _ZGVcN8v_atanhf) VECTOR_WRAPPER (WRAPPER_NAME (acoshf), _ZGVcN8v_acoshf) VECTOR_WRAPPER (WRAPPER_NAME (erff), _ZGVcN8v_erff) VECTOR_WRAPPER (WRAPPER_NAME (tanhf), _ZGVcN8v_tanhf) +VECTOR_WRAPPER (WRAPPER_NAME (asinhf), _ZGVcN8v_asinhf) #define VEC_INT_TYPE __m128i