From patchwork Mon Apr 29 18:05:03 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Anthony Liguori X-Patchwork-Id: 240431 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 004942C00A8 for ; Tue, 30 Apr 2013 04:18:27 +1000 (EST) Received: from localhost ([::1]:43452 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UWsee-0003Dj-Ba for incoming@patchwork.ozlabs.org; Mon, 29 Apr 2013 14:18:24 -0400 Received: from eggs.gnu.org ([208.118.235.92]:49934) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UWsd0-0001ih-Pn for qemu-devel@nongnu.org; Mon, 29 Apr 2013 14:17:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UWscf-0004qw-6R for qemu-devel@nongnu.org; Mon, 29 Apr 2013 14:16:42 -0400 Received: from e23smtp05.au.ibm.com ([202.81.31.147]:52494) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UWscc-0004qD-Ry for qemu-devel@nongnu.org; Mon, 29 Apr 2013 14:16:21 -0400 Received: from /spool/local by e23smtp05.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 30 Apr 2013 04:00:59 +1000 Received: from d23dlp01.au.ibm.com (202.81.31.203) by e23smtp05.au.ibm.com (202.81.31.211) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 30 Apr 2013 04:00:56 +1000 Received: from d23relay05.au.ibm.com (d23relay05.au.ibm.com [9.190.235.152]) by d23dlp01.au.ibm.com (Postfix) with ESMTP id BA0C72CE8051 for ; Tue, 30 Apr 2013 04:05:14 +1000 (EST) Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r3THpOua23134208 for ; Tue, 30 Apr 2013 03:51:24 +1000 Received: from d23av03.au.ibm.com (loopback [127.0.0.1]) by d23av03.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r3TI5CrU030296 for ; Tue, 30 Apr 2013 04:05:14 +1000 Received: from titi.austin.rr.com ([9.80.5.117]) by d23av03.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id r3TI54QX030200; Tue, 30 Apr 2013 04:05:04 +1000 From: Anthony Liguori To: qemu-devel@nongnu.org Date: Mon, 29 Apr 2013 13:05:03 -0500 Message-Id: <1367258703-6930-1-git-send-email-aliguori@us.ibm.com> X-Mailer: git-send-email 1.8.0 MIME-Version: 1.0 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13042918-1396-0000-0000-000002E09C25 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic] X-Received-From: 202.81.31.147 Cc: Thiemo Seufer , Peter Maydell , Anthony Liguori , Richard Henderson , Stefan Weil , Juan Quintela , Max Filippov , Richard Sandiford , Jocelyn Mayer , Blue Swirl , Christophe Lyon , Paul Brook , malc , Paolo Bonzini , Guan Xuetao , =?UTF-8?q?Andreas=20F=C3=A4rber?= , Aurelien Jarno , Avi Kivity Subject: [Qemu-devel] [PATCH] softfloat: rebase to version 2a X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org N.B. If you are on CC, see after the '---' for a requested action! The license of SoftFloat-2b is claimed to be GPLv2 incompatible by the FSF due to an indemnification clause. The previous release, SoftFloat-2a, did not contain this clause. The only changes between these two versions as far as QEMU is concerned is the license change and a global modification of the comment structure. This patch rebases our softfloat code to SoftFloat-2a in order to have a GPLv2 compatible license. Please note, this is a comment-only change. The resulting binary should be the same. I created this patch using the following strategy: 1) Create a branch using the original import of softfloat code: $ git checkout 158142c2c2df728cfa3b5320c65534921a764f26 2) Remove carriage returns from Softfloat-2b 3) Compare each of the softfloat files against Softfloat-2b using the following mapping to generate Fabrice's original softfloat changes: - fpu/softfloat.c -> softfloat/bits64/softfloat.c - fpu/softfloat.h -> softfloat/bits64/386-Win32-gcc/softfloat.h - fpu/softfloat-macros.h -> softfloat/bits64/softfloat-macros - fpu/softfloat-specialize.h -> softfloat/bits64/386-Win32-gcc/softfloat-specialize 4) Replace our softfloat files with the corresponding files from Softfloat-2a 5) Apply the diffs from (3) to (4) and commit 6) Create a diff between (5) and 158142c2c2df728cfa3b5320c65534921a764f26 - This diff consists 100% of licensing change + comment reformating 7) Checkout the latest master branch, apply the diff from (6) - There were a lot of comment rejects, confirmed this was only comments and then used an emacs macro to rewrite the comments to the Softfloat-2a form. Cc: Andreas Färber Cc: Aurelien Jarno Cc: Avi Kivity Cc: Ben Taylor Cc: Blue Swirl Cc: Christophe Lyon Cc: Fabrice Bellard Cc: Guan Xuetao Cc: Jocelyn Mayer Cc: Juan Quintela Cc: malc Cc: Max Filippov Cc: Paolo Bonzini Cc: Paul Brook Cc: Peter Maydell Cc: Richard Henderson Cc: Richard Sandiford Cc: Stefan Weil Cc: Thiemo Seufer Signed-off-by: Anthony Liguori Acked-by: Richard Henderson Acked-by: Paolo Bonzini Acked-by: Max Filippov Acked-by: Juan Quintela Acked-by: Stefan Weil Acked-by: Richard Sandiford Acked-by: Avi Kivity Acked-by: Your Name bentaylor.solx86@gmail.com if needed. It looked like Acked-by: Guan Xuetao Acked-by: Aurelien Jarno Acked-by: Andreas Färber Acked-by: Blue Swirl Acked-by: Paul Brook Acked-by: Christophe Lyon Acked-by: Peter Maydell --- In order to make this change, we need to relicense all contributions from initial import of the SoftFloat code to match the license of SoftFloat-2a (instead of the implied SoftFloat-2b license). If you are on CC, it is because you have contributed to the softfloat code in QEMU. Please response to this note with: Acked-by: Your Name To significant that you are able and willing to relicense your changes to the SoftFloat-1a license (or a GPL compatible license). Please respond no later than May 6th, 2013. If we are unable to confirm relicense from an author, changes from that author will be reverted. --- For completeness, here is the full listing of contributions: Andreas Färber be45f06 Silence softfloat warnings on OpenSolaris 5aea4c5 softfloat: Replace uint16 type with uint_fast16_t 94a49d8 softfloat: Replace int16 type with int_fast16_t c969654 softfloat: Fix mixups of int and int16 38641f8 softfloat: Use uint16 consistently 87b8cc3 softfloat: Resolve type mismatches between declaration and implementation 8d725fa softfloat: Prepend QEMU-style header with derivation notice 9f8d2a0 softfloat: Use uint32 consistently bb98fe4 softfloat: Drop [s]bits{8, 16, 32, 64} types in favor of [u]int{8, 16, 32, 64}_t Aurelien Jarno 1020160 softfloat: fix default-NaN mode 084d19b target-mips: Implement correct NaN propagation rules 196cfc8 softfloat: add a 1.0 constant for float32 and float64 1b2ad2e softfloat-native: fix *nan() 1f398e0 softfloat: use float{32,64,x80,128}_maybe_silence_nan() 211315f softfloat: rename float*_eq() into float*_eq_quiet() 2657d0f softfloat: rename float*_eq_signaling() into float*_eq() 30e7a22 Use float_relation_* constants 326b9e9 softfloat: fix float*_scalnb() corner cases 34d2386 softfloat: remove HPPA specific code 374dfc3 soft-float: add float32_log2() and float64_log2() 4cc5383 softfloat-native: add float*_is_any_nan() functions 587eabf softfloat: add float*_is_zero_or_denormal() 629bd74 softfloat-native: add float32_is_nan() 67b7861 softfloat: add float*_unordered_{,quiet}() functions 8229c99 softfloat: add float32_exp2() 85016c9 Assortment of soft-float fixes, by Aurelien Jarno. 8d6c92b softfloat-native: improve correctness of floatXX_is_neg() 93ae1c6 softfloat: fix float{32,64}_maybe_silence_nan() for MIPS a167ba5 Add support for GNU/kFreeBSD b3b4c7f softfloat: use GCC builtins to count the leading zeros b4a0ef7 softfloat-native: add float*_unordered_quiet() functions b689362 softfloat: move float*_eq and float*_eq_quiet b76235e softfloat: fix floatx80_is_infinity() bbc1ded softfloat: implement fused multiply-add NaN propagation for MIPS be22a9a softfloat: always enable floatx80 and float128 support c4b4c77 softfloat: add pi constants c52ab6f fp: add floatXX_is_infinity(), floatXX_is_neg(), floatXX_is_zero() cf67c6b softfloat-native: remove d2b1027 softfloat-native: add a few constant values d6882cf softfloat-native: fix float*_scalbn() functions d735d69 softfloat: rename *IsNaN variables to *IsQuietNaN dadd71a fp: fix float32_is_infinity() de4af5f softfloat: fix floatx80_is_{quiet,signaling}_nan() e024e88 target-ppc: Implement correct NaN propagation rules e2f4220 softfloat: fix floatx80 handling of NaN e872aa8 softfloat-native: fix type of float_rounding_mode e908775 softfloat: SH4 has the sNaN bit set f3218a8 softfloat: add floatx80 constants f5a6425 softfloat: improve description of comparison functions f6714d3 softfloat: add floatx80_compare*() functions f6a7d92 softfloat: add float{x80,128}_maybe_silence_nan() Avi Kivity 3bf7e40 softfloat: fix for C99 Ben Taylor 0475a5c Solaris 9/x86 support, by Ben Taylor. c94655b Updated Solaris isinf support, by Juergen Keil and Ben Taylor. Blue Swirl 128ab2f Preliminary OpenBSD host support (based on OpenBSD patches by Todd T. Fries) 14d483e Fix OpenSolaris softfloat warnings 179a2c1 Rename _BSD to HOST_BSD so that it's more obvious that it's defined by configure 1d6198c Remove unnecessary trailing newlines 1f58732 128-bit float support for user mode 2734c70 Rename one more _BSD to HOST_BSD (spotted by Hasso Tepper) 3f4cb3d Fix OpenSolaris gcc4 warnings: iovec type mismatches, missing 'static' 70c1470 Sparse fixes: dubious mixing of bitwise and logical operations 7c2a9d0 Fix math warnings on OpenBSD -current b1d8e52 Fix undeclared symbol warnings from sparse b55266b Suppress gcc 4.x -Wpointer-sign (included in -Wall) warnings cd8a253 Fix more typos in softloat code (Eduardo Felipe) d07cca0 Add native softfloat fpu functions (Christoph Egger) ed086f3 softfloat: remove dead assignments, spotted by clang Christophe Lyon 8559666 softfloat: move all default NaN definitions to softfloat.h. bcd4d9a softfloat: Honour default_nan_mode for float-to-float conversions c30fe7d softfloat: add _set_sign(), _infinity and _half for 32 and 64 bits floats. Fabrice Bellard 158142c soft float support 1b2b0af 64 bit fix 1d6bda3 added abs, chs and compare functions 38cfa06 Solaris port (Ben Taylor) 750afe9 avoid using char when it is not necessary b109f9f more native FPU comparison functions - native FPU remainder ec530c8 Solaris port (Ben Taylor) fdbb469 Solaris/SPARC host port (Ben Taylor) Guan Xuetao d2fbca9 unicore32: necessary modifications for other files to support unicore32 Jocelyn Mayer 3430b0b Ooops... Typo. 75d62a5 Add missing softfloat helpers. Juan Quintela 0eb4fc8 softfloat: make USE_SOFTFLOAT_STRUCT_TYPES compile 71e72a1 rename HOST_BSD to CONFIG_BSD 75b5a69 rename NEEDS_LIBSUNMATH to CONFIG_NEEDS_LIBSUNMATH dfe5fff change HOST_SOLARIS to CONFIG_SOLARIS{_VERSION} e2542fe rename WORDS_BIGENDIAN to HOST_WORDS_BIGENDIAN malc 947f5fc Add static qualifier to local functions e58ffeb Remove all traces of __powerpc__ Max Filippov 6617680 softfloat: make float_muladd_negate_* flags independent 213ff4e softfloat: add NO_SIGNALING_NANS b81fe82 target-xtensa: specialize softfloat NaN rules Paolo Bonzini 1de7afc misc: move include files to include/qemu/ 6b4c305 fpu: move public header file to include/fpu 789ec7c softfloat: change default nan definitions to variables Paul Brook 6001149 ARM FP16 support 6939754 Correctly normalize values and handle zero inputs to scalbn functions. 3598ecb Remove missing include. 5c7908e Implement default-NaN mode. 7918bf4 Fix typo in BSD FP rounding mode names. 9027db8 Fix ARM default NaN. 9ee6e8b ARMv7 support. a1b91bb Fix typo in softfloat code. e6e5906 ColdFire target. f090c9d Add strict checking mode for softfp code. fe76d97 Implement flush-to-zero mode (denormal results are replaced with zero). Peter Maydell 1856987 softfloat: Rename float*_is_nan() functions to float*_is_quiet_nan() 760e141 softfloat: roundAndPackInt{32, 64}: Don't assume int32 is 32 bits 011da61 target-arm: Implement correct NaN propagation rules 21d6ebd softfloat: Add float*_is_any_nan() functions 274f1b0 softfloat: Add float*_min() and float*_max() functions 2ac8bd0 softfloat: Reinstate accidentally disabled target-specific NaN handling 2bed652 softfloat: Implement floatx80_is_any_nan() and float128_is_any_nan() 354f211 softfloat: abstract out target-specific NaN propagation rules 369be8f softfloat: Implement fused multiply-add 37d1866 softfloat: Implement flushing input denormals to zero 4be8eea fpu/softfloat.c: Remove pointless shift of always-zero value 600e30d softfloat: Fix single-to-half precision float conversions 6f3300a softfloat: Add float32_is_zero_or_denormal() function b3a6a2e softfloat: float*_to_int32_round_to_zero: don't assume int32 is 32 bits b408dbd softfloat: Add float*_maybe_silence_nan() functions bb4d4bb softfloat: Add float16 type and float16 NaN handling functions c29aca4 softfloat: Add setter function for tininess detection mode cbcef45 softfloat: Add float/double to 16 bit integer conversion functions d5138cf softfloat: Fix compilation failures with USE_SOFTFLOAT_STRUCT_TYPES e3d142d fpu: Correct edgecase in float64_muladd e6afc87 softfloat: Add new flag for when denormal result is flushed to zero e744c06 fpu/softfloat.c: Return correctly signed values from uint64_to_float32 f591e1b softfloat: Correctly handle NaNs in float16_to_float32() Richard Henderson 17ed229 softfloat: Fix uint64_to_float64 1e397ea softfloat: Implement uint64_to_float128 8443eff target-alpha: Split up FPCR value into separate fields. 990b3e1 target-alpha: Enable softfloat. ba0e276 target-alpha: Fixes for alpha-linux syscalls. Richard Sandiford a6e7c18 softfloat: Handle float_muladd_negate_c when product is zero Stefan Weil bc4347b arm host: fix compiler warning Thiemo Seufer 5a6932d Fix NaN handling for MIPS and HPPA. 5fafdf2 find -type f | xargs sed -i 's/[\t ]$//g' # on most files 63a654b trunc() for Solaris 9 / SPARC, by Juergen Keil. 924b2c0 Add proper float*_is_nan prototypes. b645bb4 Fix softfloat NaN handling. fc81ba5 Check that HOST_SOLARIS is defined before relying on its value. Spotted by Joachim Henke. --- fpu/softfloat-macros.h | 430 ++++---- fpu/softfloat-specialize.h | 494 +++++---- fpu/softfloat.c | 2436 ++++++++++++++++++++++++-------------------- include/fpu/softfloat.h | 242 +++-- 4 files changed, 1981 insertions(+), 1621 deletions(-) diff --git a/fpu/softfloat-macros.h b/fpu/softfloat-macros.h index b5164af..2009315 100644 --- a/fpu/softfloat-macros.h +++ b/fpu/softfloat-macros.h @@ -4,10 +4,11 @@ * Derived from SoftFloat. */ -/*============================================================================ +/* +=============================================================================== This C source fragment is part of the SoftFloat IEC/IEEE Floating-point -Arithmetic Package, Release 2b. +Arithmetic Package, Release 2a. Written by John R. Hauser. This work was made possible in part by the International Computer Science Institute, located at Suite 600, 1947 Center @@ -16,28 +17,27 @@ National Science Foundation under grant MIP-9311980. The original version of this code was written as part of a project to build a fixed-point vector processor in collaboration with the University of California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. More information -is available through the Web page `http://www.cs.berkeley.edu/~jhauser/ +is available through the Web page `http://HTTP.CS.Berkeley.EDU/~jhauser/ arithmetic/SoftFloat.html'. -THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort has -been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT TIMES -RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO PERSONS -AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ALL LOSSES, -COSTS, OR OTHER PROBLEMS THEY INCUR DUE TO THE SOFTWARE, AND WHO FURTHERMORE -EFFECTIVELY INDEMNIFY JOHN HAUSER AND THE INTERNATIONAL COMPUTER SCIENCE -INSTITUTE (possibly via similar legal notice) AGAINST ALL LOSSES, COSTS, OR -OTHER PROBLEMS INCURRED BY THEIR CUSTOMERS AND CLIENTS DUE TO THE SOFTWARE. +THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort +has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT +TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO +PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY +AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. Derivative works are acceptable, even for commercial purposes, so long as -(1) the source code for the derivative work includes prominent notice that -the work is derivative, and (2) the source code includes prominent notice with -these four paragraphs for those parts of this code that are retained. +(1) they include prominent notice that the work is derivative, and (2) they +include prominent notice akin to these four paragraphs for those parts of +this code that are retained. =============================================================================*/ -/*---------------------------------------------------------------------------- -| This macro tests for minimum version of the GNU C compiler. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +This macro tests for minimum version of the GNU C compiler. +------------------------------------------------------------------------------- +*/ #if defined(__GNUC__) && defined(__GNUC_MINOR__) # define SOFTFLOAT_GNUC_PREREQ(maj, min) \ ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) @@ -46,14 +46,16 @@ these four paragraphs for those parts of this code that are retained. #endif -/*---------------------------------------------------------------------------- -| Shifts `a' right by the number of bits given in `count'. If any nonzero -| bits are shifted off, they are ``jammed'' into the least significant bit of -| the result by setting the least significant bit to 1. The value of `count' -| can be arbitrarily large; in particular, if `count' is greater than 32, the -| result will be either 0 or 1, depending on whether `a' is zero or nonzero. -| The result is stored in the location pointed to by `zPtr'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Shifts `a' right by the number of bits given in `count'. If any nonzero +bits are shifted off, they are ``jammed'' into the least significant bit of +the result by setting the least significant bit to 1. The value of `count' +can be arbitrarily large; in particular, if `count' is greater than 32, the +result will be either 0 or 1, depending on whether `a' is zero or nonzero. +The result is stored in the location pointed to by `zPtr'. +------------------------------------------------------------------------------- +*/ INLINE void shift32RightJamming(uint32_t a, int_fast16_t count, uint32_t *zPtr) { @@ -72,14 +74,16 @@ INLINE void shift32RightJamming(uint32_t a, int_fast16_t count, uint32_t *zPtr) } -/*---------------------------------------------------------------------------- -| Shifts `a' right by the number of bits given in `count'. If any nonzero -| bits are shifted off, they are ``jammed'' into the least significant bit of -| the result by setting the least significant bit to 1. The value of `count' -| can be arbitrarily large; in particular, if `count' is greater than 64, the -| result will be either 0 or 1, depending on whether `a' is zero or nonzero. -| The result is stored in the location pointed to by `zPtr'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Shifts `a' right by the number of bits given in `count'. If any nonzero +bits are shifted off, they are ``jammed'' into the least significant bit of +the result by setting the least significant bit to 1. The value of `count' +can be arbitrarily large; in particular, if `count' is greater than 64, the +result will be either 0 or 1, depending on whether `a' is zero or nonzero. +The result is stored in the location pointed to by `zPtr'. +------------------------------------------------------------------------------- +*/ INLINE void shift64RightJamming(uint64_t a, int_fast16_t count, uint64_t *zPtr) { @@ -98,23 +102,24 @@ INLINE void shift64RightJamming(uint64_t a, int_fast16_t count, uint64_t *zPtr) } -/*---------------------------------------------------------------------------- -| Shifts the 128-bit value formed by concatenating `a0' and `a1' right by 64 -| _plus_ the number of bits given in `count'. The shifted result is at most -| 64 nonzero bits; this is stored at the location pointed to by `z0Ptr'. The -| bits shifted off form a second 64-bit result as follows: The _last_ bit -| shifted off is the most-significant bit of the extra result, and the other -| 63 bits of the extra result are all zero if and only if _all_but_the_last_ -| bits shifted off were all zero. This extra result is stored in the location -| pointed to by `z1Ptr'. The value of `count' can be arbitrarily large. -| (This routine makes more sense if `a0' and `a1' are considered to form -| a fixed-point value with binary point between `a0' and `a1'. This fixed- -| point value is shifted right by the number of bits given in `count', and -| the integer part of the result is returned at the location pointed to by -| `z0Ptr'. The fractional part of the result may be slightly corrupted as -| described above, and is returned at the location pointed to by `z1Ptr'.) -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Shifts the 128-bit value formed by concatenating `a0' and `a1' right by 64 +_plus_ the number of bits given in `count'. The shifted result is at most +64 nonzero bits; this is stored at the location pointed to by `z0Ptr'. The +bits shifted off form a second 64-bit result as follows: The _last_ bit +shifted off is the most-significant bit of the extra result, and the other +63 bits of the extra result are all zero if and only if _all_but_the_last_ +bits shifted off were all zero. This extra result is stored in the location +pointed to by `z1Ptr'. The value of `count' can be arbitrarily large. + (This routine makes more sense if `a0' and `a1' are considered to form a +fixed-point value with binary point between `a0' and `a1'. This fixed-point +value is shifted right by the number of bits given in `count', and the +integer part of the result is returned at the location pointed to by +`z0Ptr'. The fractional part of the result may be slightly corrupted as +described above, and is returned at the location pointed to by `z1Ptr'.) +------------------------------------------------------------------------------- +*/ INLINE void shift64ExtraRightJamming( uint64_t a0, uint64_t a1, int_fast16_t count, uint64_t *z0Ptr, uint64_t *z1Ptr) @@ -144,14 +149,15 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Shifts the 128-bit value formed by concatenating `a0' and `a1' right by the -| number of bits given in `count'. Any bits shifted off are lost. The value -| of `count' can be arbitrarily large; in particular, if `count' is greater -| than 128, the result will be 0. The result is broken into two 64-bit pieces -| which are stored at the locations pointed to by `z0Ptr' and `z1Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Shifts the 128-bit value formed by concatenating `a0' and `a1' right by the +number of bits given in `count'. Any bits shifted off are lost. The value +of `count' can be arbitrarily large; in particular, if `count' is greater +than 128, the result will be 0. The result is broken into two 64-bit pieces +which are stored at the locations pointed to by `z0Ptr' and `z1Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void shift128Right( uint64_t a0, uint64_t a1, int_fast16_t count, uint64_t *z0Ptr, uint64_t *z1Ptr) @@ -176,17 +182,18 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Shifts the 128-bit value formed by concatenating `a0' and `a1' right by the -| number of bits given in `count'. If any nonzero bits are shifted off, they -| are ``jammed'' into the least significant bit of the result by setting the -| least significant bit to 1. The value of `count' can be arbitrarily large; -| in particular, if `count' is greater than 128, the result will be either -| 0 or 1, depending on whether the concatenation of `a0' and `a1' is zero or -| nonzero. The result is broken into two 64-bit pieces which are stored at -| the locations pointed to by `z0Ptr' and `z1Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Shifts the 128-bit value formed by concatenating `a0' and `a1' right by the +number of bits given in `count'. If any nonzero bits are shifted off, they +are ``jammed'' into the least significant bit of the result by setting the +least significant bit to 1. The value of `count' can be arbitrarily large; +in particular, if `count' is greater than 128, the result will be either +0 or 1, depending on whether the concatenation of `a0' and `a1' is zero or +nonzero. The result is broken into two 64-bit pieces which are stored at +the locations pointed to by `z0Ptr' and `z1Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void shift128RightJamming( uint64_t a0, uint64_t a1, int_fast16_t count, uint64_t *z0Ptr, uint64_t *z1Ptr) @@ -219,25 +226,26 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Shifts the 192-bit value formed by concatenating `a0', `a1', and `a2' right -| by 64 _plus_ the number of bits given in `count'. The shifted result is -| at most 128 nonzero bits; these are broken into two 64-bit pieces which are -| stored at the locations pointed to by `z0Ptr' and `z1Ptr'. The bits shifted -| off form a third 64-bit result as follows: The _last_ bit shifted off is -| the most-significant bit of the extra result, and the other 63 bits of the -| extra result are all zero if and only if _all_but_the_last_ bits shifted off -| were all zero. This extra result is stored in the location pointed to by -| `z2Ptr'. The value of `count' can be arbitrarily large. -| (This routine makes more sense if `a0', `a1', and `a2' are considered -| to form a fixed-point value with binary point between `a1' and `a2'. This -| fixed-point value is shifted right by the number of bits given in `count', -| and the integer part of the result is returned at the locations pointed to -| by `z0Ptr' and `z1Ptr'. The fractional part of the result may be slightly -| corrupted as described above, and is returned at the location pointed to by -| `z2Ptr'.) -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Shifts the 192-bit value formed by concatenating `a0', `a1', and `a2' right +by 64 _plus_ the number of bits given in `count'. The shifted result is +at most 128 nonzero bits; these are broken into two 64-bit pieces which are +stored at the locations pointed to by `z0Ptr' and `z1Ptr'. The bits shifted +off form a third 64-bit result as follows: The _last_ bit shifted off is +the most-significant bit of the extra result, and the other 63 bits of the +extra result are all zero if and only if _all_but_the_last_ bits shifted off +were all zero. This extra result is stored in the location pointed to by +`z2Ptr'. The value of `count' can be arbitrarily large. + (This routine makes more sense if `a0', `a1', and `a2' are considered +to form a fixed-point value with binary point between `a1' and `a2'. This +fixed-point value is shifted right by the number of bits given in `count', +and the integer part of the result is returned at the locations pointed to +by `z0Ptr' and `z1Ptr'. The fractional part of the result may be slightly +corrupted as described above, and is returned at the location pointed to by +`z2Ptr'.) +------------------------------------------------------------------------------- +*/ INLINE void shift128ExtraRightJamming( uint64_t a0, @@ -289,13 +297,14 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Shifts the 128-bit value formed by concatenating `a0' and `a1' left by the -| number of bits given in `count'. Any bits shifted off are lost. The value -| of `count' must be less than 64. The result is broken into two 64-bit -| pieces which are stored at the locations pointed to by `z0Ptr' and `z1Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Shifts the 128-bit value formed by concatenating `a0' and `a1' left by the +number of bits given in `count'. Any bits shifted off are lost. The value +of `count' must be less than 64. The result is broken into two 64-bit +pieces which are stored at the locations pointed to by `z0Ptr' and `z1Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void shortShift128Left( uint64_t a0, uint64_t a1, int_fast16_t count, uint64_t *z0Ptr, uint64_t *z1Ptr) @@ -307,14 +316,15 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Shifts the 192-bit value formed by concatenating `a0', `a1', and `a2' left -| by the number of bits given in `count'. Any bits shifted off are lost. -| The value of `count' must be less than 64. The result is broken into three -| 64-bit pieces which are stored at the locations pointed to by `z0Ptr', -| `z1Ptr', and `z2Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Shifts the 192-bit value formed by concatenating `a0', `a1', and `a2' left +by the number of bits given in `count'. Any bits shifted off are lost. +The value of `count' must be less than 64. The result is broken into three +64-bit pieces which are stored at the locations pointed to by `z0Ptr', +`z1Ptr', and `z2Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void shortShift192Left( uint64_t a0, @@ -343,13 +353,14 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Adds the 128-bit value formed by concatenating `a0' and `a1' to the 128-bit -| value formed by concatenating `b0' and `b1'. Addition is modulo 2^128, so -| any carry out is lost. The result is broken into two 64-bit pieces which -| are stored at the locations pointed to by `z0Ptr' and `z1Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Adds the 128-bit value formed by concatenating `a0' and `a1' to the 128-bit +value formed by concatenating `b0' and `b1'. Addition is modulo 2^128, so +any carry out is lost. The result is broken into two 64-bit pieces which +are stored at the locations pointed to by `z0Ptr' and `z1Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void add128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1, uint64_t *z0Ptr, uint64_t *z1Ptr ) @@ -362,14 +373,15 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Adds the 192-bit value formed by concatenating `a0', `a1', and `a2' to the -| 192-bit value formed by concatenating `b0', `b1', and `b2'. Addition is -| modulo 2^192, so any carry out is lost. The result is broken into three -| 64-bit pieces which are stored at the locations pointed to by `z0Ptr', -| `z1Ptr', and `z2Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Adds the 192-bit value formed by concatenating `a0', `a1', and `a2' to the +192-bit value formed by concatenating `b0', `b1', and `b2'. Addition is +modulo 2^192, so any carry out is lost. The result is broken into three +64-bit pieces which are stored at the locations pointed to by `z0Ptr', +`z1Ptr', and `z2Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void add192( uint64_t a0, @@ -400,14 +412,15 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Subtracts the 128-bit value formed by concatenating `b0' and `b1' from the -| 128-bit value formed by concatenating `a0' and `a1'. Subtraction is modulo -| 2^128, so any borrow out (carry out) is lost. The result is broken into two -| 64-bit pieces which are stored at the locations pointed to by `z0Ptr' and -| `z1Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Subtracts the 128-bit value formed by concatenating `b0' and `b1' from the +128-bit value formed by concatenating `a0' and `a1'. Subtraction is modulo +2^128, so any borrow out (carry out) is lost. The result is broken into two +64-bit pieces which are stored at the locations pointed to by `z0Ptr' and +`z1Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void sub128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1, uint64_t *z0Ptr, uint64_t *z1Ptr ) @@ -418,14 +431,15 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Subtracts the 192-bit value formed by concatenating `b0', `b1', and `b2' -| from the 192-bit value formed by concatenating `a0', `a1', and `a2'. -| Subtraction is modulo 2^192, so any borrow out (carry out) is lost. The -| result is broken into three 64-bit pieces which are stored at the locations -| pointed to by `z0Ptr', `z1Ptr', and `z2Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Subtracts the 192-bit value formed by concatenating `b0', `b1', and `b2' +from the 192-bit value formed by concatenating `a0', `a1', and `a2'. +Subtraction is modulo 2^192, so any borrow out (carry out) is lost. The +result is broken into three 64-bit pieces which are stored at the locations +pointed to by `z0Ptr', `z1Ptr', and `z2Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void sub192( uint64_t a0, @@ -456,11 +470,13 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Multiplies `a' by `b' to obtain a 128-bit product. The product is broken -| into two 64-bit pieces which are stored at the locations pointed to by -| `z0Ptr' and `z1Ptr'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Multiplies `a' by `b' to obtain a 128-bit product. The product is broken +into two 64-bit pieces which are stored at the locations pointed to by +`z0Ptr' and `z1Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void mul64To128( uint64_t a, uint64_t b, uint64_t *z0Ptr, uint64_t *z1Ptr ) { @@ -485,13 +501,14 @@ INLINE void mul64To128( uint64_t a, uint64_t b, uint64_t *z0Ptr, uint64_t *z1Ptr } -/*---------------------------------------------------------------------------- -| Multiplies the 128-bit value formed by concatenating `a0' and `a1' by -| `b' to obtain a 192-bit product. The product is broken into three 64-bit -| pieces which are stored at the locations pointed to by `z0Ptr', `z1Ptr', and -| `z2Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Multiplies the 128-bit value formed by concatenating `a0' and `a1' by +`b' to obtain a 192-bit product. The product is broken into three 64-bit +pieces which are stored at the locations pointed to by `z0Ptr', `z1Ptr', and +`z2Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void mul128By64To192( uint64_t a0, @@ -513,13 +530,14 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Multiplies the 128-bit value formed by concatenating `a0' and `a1' to the -| 128-bit value formed by concatenating `b0' and `b1' to obtain a 256-bit -| product. The product is broken into four 64-bit pieces which are stored at -| the locations pointed to by `z0Ptr', `z1Ptr', `z2Ptr', and `z3Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Multiplies the 128-bit value formed by concatenating `a0' and `a1' to the +128-bit value formed by concatenating `b0' and `b1' to obtain a 256-bit +product. The product is broken into four 64-bit pieces which are stored at +the locations pointed to by `z0Ptr', `z1Ptr', `z2Ptr', and `z3Ptr'. +------------------------------------------------------------------------------- +*/ INLINE void mul128To256( uint64_t a0, @@ -550,14 +568,16 @@ INLINE void } -/*---------------------------------------------------------------------------- -| Returns an approximation to the 64-bit integer quotient obtained by dividing -| `b' into the 128-bit value formed by concatenating `a0' and `a1'. The -| divisor `b' must be at least 2^63. If q is the exact quotient truncated -| toward zero, the approximation returned lies between q and q + 2 inclusive. -| If the exact quotient q is larger than 64 bits, the maximum positive 64-bit -| unsigned integer is returned. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns an approximation to the 64-bit integer quotient obtained by dividing +`b' into the 128-bit value formed by concatenating `a0' and `a1'. The +divisor `b' must be at least 2^63. If q is the exact quotient truncated +toward zero, the approximation returned lies between q and q + 2 inclusive. +If the exact quotient q is larger than 64 bits, the maximum positive 64-bit +unsigned integer is returned. +------------------------------------------------------------------------------- +*/ static uint64_t estimateDiv128To64( uint64_t a0, uint64_t a1, uint64_t b ) { @@ -581,15 +601,17 @@ static uint64_t estimateDiv128To64( uint64_t a0, uint64_t a1, uint64_t b ) } -/*---------------------------------------------------------------------------- -| Returns an approximation to the square root of the 32-bit significand given -| by `a'. Considered as an integer, `a' must be at least 2^31. If bit 0 of -| `aExp' (the least significant bit) is 1, the integer returned approximates -| 2^31*sqrt(`a'/2^31), where `a' is considered an integer. If bit 0 of `aExp' -| is 0, the integer returned approximates 2^31*sqrt(`a'/2^30). In either -| case, the approximation returned lies strictly within +/-2 of the exact -| value. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns an approximation to the square root of the 32-bit significand given +by `a'. Considered as an integer, `a' must be at least 2^31. If bit 0 of +`aExp' (the least significant bit) is 1, the integer returned approximates +2^31*sqrt(`a'/2^31), where `a' is considered an integer. If bit 0 of `aExp' +is 0, the integer returned approximates 2^31*sqrt(`a'/2^30). In either +case, the approximation returned lies strictly within +/-2 of the exact +value. +------------------------------------------------------------------------------- +*/ static uint32_t estimateSqrt32(int_fast16_t aExp, uint32_t a) { @@ -620,10 +642,12 @@ static uint32_t estimateSqrt32(int_fast16_t aExp, uint32_t a) } -/*---------------------------------------------------------------------------- -| Returns the number of leading 0 bits before the most-significant 1 bit of -| `a'. If `a' is zero, 32 is returned. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the number of leading 0 bits before the most-significant 1 bit of +`a'. If `a' is zero, 32 is returned. +------------------------------------------------------------------------------- +*/ static int8 countLeadingZeros32( uint32_t a ) { @@ -668,10 +692,12 @@ static int8 countLeadingZeros32( uint32_t a ) #endif } -/*---------------------------------------------------------------------------- -| Returns the number of leading 0 bits before the most-significant 1 bit of -| `a'. If `a' is zero, 64 is returned. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the number of leading 0 bits before the most-significant 1 bit of +`a'. If `a' is zero, 64 is returned. +------------------------------------------------------------------------------- +*/ static int8 countLeadingZeros64( uint64_t a ) { @@ -696,11 +722,13 @@ static int8 countLeadingZeros64( uint64_t a ) #endif } -/*---------------------------------------------------------------------------- -| Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' -| is equal to the 128-bit value formed by concatenating `b0' and `b1'. -| Otherwise, returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' +is equal to the 128-bit value formed by concatenating `b0' and `b1'. +Otherwise, returns 0. +------------------------------------------------------------------------------- +*/ INLINE flag eq128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1 ) { @@ -709,11 +737,13 @@ INLINE flag eq128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1 ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' is less -| than or equal to the 128-bit value formed by concatenating `b0' and `b1'. -| Otherwise, returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' is less +than or equal to the 128-bit value formed by concatenating `b0' and `b1'. +Otherwise, returns 0. +------------------------------------------------------------------------------- +*/ INLINE flag le128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1 ) { @@ -722,11 +752,13 @@ INLINE flag le128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1 ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' is less -| than the 128-bit value formed by concatenating `b0' and `b1'. Otherwise, -| returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' is less +than the 128-bit value formed by concatenating `b0' and `b1'. Otherwise, +returns 0. +------------------------------------------------------------------------------- +*/ INLINE flag lt128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1 ) { @@ -735,11 +767,13 @@ INLINE flag lt128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1 ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' is -| not equal to the 128-bit value formed by concatenating `b0' and `b1'. -| Otherwise, returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the 128-bit value formed by concatenating `a0' and `a1' is +not equal to the 128-bit value formed by concatenating `b0' and `b1'. +Otherwise, returns 0. +------------------------------------------------------------------------------- +*/ INLINE flag ne128( uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1 ) { diff --git a/fpu/softfloat-specialize.h b/fpu/softfloat-specialize.h index 518f694..ba9bfeb 100644 --- a/fpu/softfloat-specialize.h +++ b/fpu/softfloat-specialize.h @@ -4,10 +4,11 @@ * Derived from SoftFloat. */ -/*============================================================================ +/* +=============================================================================== This C source fragment is part of the SoftFloat IEC/IEEE Floating-point -Arithmetic Package, Release 2b. +Arithmetic Package, Release 2a. Written by John R. Hauser. This work was made possible in part by the International Computer Science Institute, located at Suite 600, 1947 Center @@ -16,22 +17,19 @@ National Science Foundation under grant MIP-9311980. The original version of this code was written as part of a project to build a fixed-point vector processor in collaboration with the University of California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. More information -is available through the Web page `http://www.cs.berkeley.edu/~jhauser/ +is available through the Web page `http://HTTP.CS.Berkeley.EDU/~jhauser/ arithmetic/SoftFloat.html'. -THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort has -been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT TIMES -RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO PERSONS -AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ALL LOSSES, -COSTS, OR OTHER PROBLEMS THEY INCUR DUE TO THE SOFTWARE, AND WHO FURTHERMORE -EFFECTIVELY INDEMNIFY JOHN HAUSER AND THE INTERNATIONAL COMPUTER SCIENCE -INSTITUTE (possibly via similar legal warning) AGAINST ALL LOSSES, COSTS, OR -OTHER PROBLEMS INCURRED BY THEIR CUSTOMERS AND CLIENTS DUE TO THE SOFTWARE. +THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort +has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT +TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO +PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY +AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. Derivative works are acceptable, even for commercial purposes, so long as -(1) the source code for the derivative work includes prominent notice that -the work is derivative, and (2) the source code includes prominent notice with -these four paragraphs for those parts of this code that are retained. +(1) they include prominent notice that the work is derivative, and (2) they +include prominent notice akin to these four paragraphs for those parts of +this code that are retained. =============================================================================*/ @@ -48,9 +46,11 @@ these four paragraphs for those parts of this code that are retained. #define NO_SIGNALING_NANS 1 #endif -/*---------------------------------------------------------------------------- -| The pattern for a default generated half-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated half-precision NaN. +------------------------------------------------------------------------------- +*/ #if defined(TARGET_ARM) const float16 float16_default_nan = const_float16(0x7E00); #elif SNAN_BIT_IS_ONE @@ -59,9 +59,11 @@ const float16 float16_default_nan = const_float16(0x7DFF); const float16 float16_default_nan = const_float16(0xFE00); #endif -/*---------------------------------------------------------------------------- -| The pattern for a default generated single-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated single-precision NaN. +------------------------------------------------------------------------------- +*/ #if defined(TARGET_SPARC) const float32 float32_default_nan = const_float32(0x7FFFFFFF); #elif defined(TARGET_PPC) || defined(TARGET_ARM) || defined(TARGET_ALPHA) || \ @@ -73,9 +75,11 @@ const float32 float32_default_nan = const_float32(0x7FBFFFFF); const float32 float32_default_nan = const_float32(0xFFC00000); #endif -/*---------------------------------------------------------------------------- -| The pattern for a default generated double-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated double-precision NaN. +------------------------------------------------------------------------------- +*/ #if defined(TARGET_SPARC) const float64 float64_default_nan = const_float64(LIT64( 0x7FFFFFFFFFFFFFFF )); #elif defined(TARGET_PPC) || defined(TARGET_ARM) || defined(TARGET_ALPHA) @@ -86,9 +90,11 @@ const float64 float64_default_nan = const_float64(LIT64( 0x7FF7FFFFFFFFFFFF )); const float64 float64_default_nan = const_float64(LIT64( 0xFFF8000000000000 )); #endif -/*---------------------------------------------------------------------------- -| The pattern for a default generated extended double-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated extended double-precision NaN. +------------------------------------------------------------------------------- +*/ #if SNAN_BIT_IS_ONE #define floatx80_default_nan_high 0x7FFF #define floatx80_default_nan_low LIT64( 0xBFFFFFFFFFFFFFFF ) @@ -100,10 +106,12 @@ const float64 float64_default_nan = const_float64(LIT64( 0xFFF8000000000000 )); const floatx80 floatx80_default_nan = make_floatx80_init(floatx80_default_nan_high, floatx80_default_nan_low); -/*---------------------------------------------------------------------------- -| The pattern for a default generated quadruple-precision NaN. The `high' and -| `low' values hold the most- and least-significant bits, respectively. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated quadruple-precision NaN. The `high' and +`low' values hold the most- and least-significant bits, respectively. +------------------------------------------------------------------------------- +*/ #if SNAN_BIT_IS_ONE #define float128_default_nan_high LIT64( 0x7FFF7FFFFFFFFFFF ) #define float128_default_nan_low LIT64( 0xFFFFFFFFFFFFFFFF ) @@ -115,21 +123,25 @@ const floatx80 floatx80_default_nan const float128 float128_default_nan = make_float128_init(float128_default_nan_high, float128_default_nan_low); -/*---------------------------------------------------------------------------- -| Raises the exceptions specified by `flags'. Floating-point traps can be -| defined here if desired. It is currently not possible for such a trap -| to substitute a result value. If traps are not implemented, this routine -| should be simply `float_exception_flags |= flags;'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Raises the exceptions specified by `flags'. Floating-point traps can be +defined here if desired. It is currently not possible for such a trap +to substitute a result value. If traps are not implemented, this routine +should be simply `float_exception_flags |= flags;'. +------------------------------------------------------------------------------- +*/ void float_raise( int8 flags STATUS_PARAM ) { STATUS(float_exception_flags) |= flags; } -/*---------------------------------------------------------------------------- -| Internal canonical NaN format. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Internal canonical NaN format. +------------------------------------------------------------------------------- +*/ typedef struct { flag sign; uint64_t high, low; @@ -146,10 +158,12 @@ int float16_is_signaling_nan(float16 a_) return 0; } #else -/*---------------------------------------------------------------------------- -| Returns 1 if the half-precision floating-point value `a' is a quiet -| NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the half-precision floating-point value `a' is a quiet +NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float16_is_quiet_nan(float16 a_) { @@ -161,10 +175,12 @@ int float16_is_quiet_nan(float16 a_) #endif } -/*---------------------------------------------------------------------------- -| Returns 1 if the half-precision floating-point value `a' is a signaling -| NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the half-precision floating-point value `a' is a signaling +NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float16_is_signaling_nan(float16 a_) { @@ -177,10 +193,12 @@ int float16_is_signaling_nan(float16 a_) } #endif -/*---------------------------------------------------------------------------- -| Returns a quiet NaN if the half-precision floating point value `a' is a -| signaling NaN; otherwise returns `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns a quiet NaN if the half-precision floating point value `a' is a +signaling NaN; otherwise returns `a'. +------------------------------------------------------------------------------- +*/ float16 float16_maybe_silence_nan(float16 a_) { if (float16_is_signaling_nan(a_)) { @@ -199,11 +217,13 @@ float16 float16_maybe_silence_nan(float16 a_) return a_; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the half-precision floating-point NaN -| `a' to the canonical NaN format. If `a' is a signaling NaN, the invalid -| exception is raised. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the half-precision floating-point NaN +`a' to the canonical NaN format. If `a' is a signaling NaN, the invalid +exception is raised. +------------------------------------------------------------------------------- +*/ static commonNaNT float16ToCommonNaN( float16 a STATUS_PARAM ) { @@ -216,10 +236,12 @@ static commonNaNT float16ToCommonNaN( float16 a STATUS_PARAM ) return z; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the canonical NaN `a' to the half- -| precision floating-point format. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the canonical NaN `a' to the half- +precision floating-point format. +------------------------------------------------------------------------------- +*/ static float16 commonNaNToFloat16(commonNaNT a STATUS_PARAM) { @@ -248,10 +270,12 @@ int float32_is_signaling_nan(float32 a_) return 0; } #else -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is a quiet -| NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is a quiet +NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float32_is_quiet_nan( float32 a_ ) { @@ -263,10 +287,12 @@ int float32_is_quiet_nan( float32 a_ ) #endif } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is a signaling -| NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is a signaling +NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float32_is_signaling_nan( float32 a_ ) { @@ -279,10 +305,12 @@ int float32_is_signaling_nan( float32 a_ ) } #endif -/*---------------------------------------------------------------------------- -| Returns a quiet NaN if the single-precision floating point value `a' is a -| signaling NaN; otherwise returns `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns a quiet NaN if the single-precision floating point value `a' is a +signaling NaN; otherwise returns `a'. +------------------------------------------------------------------------------- +*/ float32 float32_maybe_silence_nan( float32 a_ ) { @@ -302,12 +330,13 @@ float32 float32_maybe_silence_nan( float32 a_ ) return a_; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point NaN -| `a' to the canonical NaN format. If `a' is a signaling NaN, the invalid -| exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point NaN +`a' to the canonical NaN format. If `a' is a signaling NaN, the invalid +exception is raised. +------------------------------------------------------------------------------- +*/ static commonNaNT float32ToCommonNaN( float32 a STATUS_PARAM ) { commonNaNT z; @@ -319,10 +348,12 @@ static commonNaNT float32ToCommonNaN( float32 a STATUS_PARAM ) return z; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the canonical NaN `a' to the single- -| precision floating-point format. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the canonical NaN `a' to the single- +precision floating-point format. +------------------------------------------------------------------------------- +*/ static float32 commonNaNToFloat32( commonNaNT a STATUS_PARAM) { @@ -339,22 +370,24 @@ static float32 commonNaNToFloat32( commonNaNT a STATUS_PARAM) return float32_default_nan; } -/*---------------------------------------------------------------------------- -| Select which NaN to propagate for a two-input operation. -| IEEE754 doesn't specify all the details of this, so the -| algorithm is target-specific. -| The routine is passed various bits of information about the -| two NaNs and should return 0 to select NaN a and 1 for NaN b. -| Note that signalling NaNs are always squashed to quiet NaNs -| by the caller, by calling floatXX_maybe_silence_nan() before -| returning them. -| -| aIsLargerSignificand is only valid if both a and b are NaNs -| of some kind, and is true if a has the larger significand, -| or if both a and b have the same significand but a is -| positive but b is negative. It is only needed for the x87 -| tie-break rule. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Select which NaN to propagate for a two-input operation. +IEEE754 doesn't specify all the details of this, so the +algorithm is target-specific. +The routine is passed various bits of information about the +two NaNs and should return 0 to select NaN a and 1 for NaN b. +Note that signalling NaNs are always squashed to quiet NaNs +by the caller, by calling floatXX_maybe_silence_nan() before +returning them. + +aIsLargerSignificand is only valid if both a and b are NaNs +of some kind, and is true if a has the larger significand, +or if both a and b have the same significand but a is +positive but b is negative. It is only needed for the x87 +tie-break rule. +------------------------------------------------------------------------------- +*/ #if defined(TARGET_ARM) static int pickNaN(flag aIsQNaN, flag aIsSNaN, flag bIsQNaN, flag bIsSNaN, @@ -451,12 +484,14 @@ static int pickNaN(flag aIsQNaN, flag aIsSNaN, flag bIsQNaN, flag bIsSNaN, } #endif -/*---------------------------------------------------------------------------- -| Select which NaN to propagate for a three-input operation. -| For the moment we assume that no CPU needs the 'larger significand' -| information. -| Return values : 0 : a; 1 : b; 2 : c; 3 : default-NaN -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Select which NaN to propagate for a three-input operation. +For the moment we assume that no CPU needs the 'larger significand' +information. +Return values : 0 : a; 1 : b; 2 : c; 3 : default-NaN +------------------------------------------------------------------------------- +*/ #if defined(TARGET_ARM) static int pickNaNMulAdd(flag aIsQNaN, flag aIsSNaN, flag bIsQNaN, flag bIsSNaN, flag cIsQNaN, flag cIsSNaN, flag infzero STATUS_PARAM) @@ -554,12 +589,13 @@ static int pickNaNMulAdd(flag aIsQNaN, flag aIsSNaN, flag bIsQNaN, flag bIsSNaN, } #endif -/*---------------------------------------------------------------------------- -| Takes two single-precision floating-point values `a' and `b', one of which -| is a NaN, and returns the appropriate NaN result. If either `a' or `b' is a -| signaling NaN, the invalid exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes two single-precision floating-point values `a' and `b', one of which +is a NaN, and returns the appropriate NaN result. If either `a' or `b' is a +signaling NaN, the invalid exception is raised. +------------------------------------------------------------------------------- +*/ static float32 propagateFloat32NaN( float32 a, float32 b STATUS_PARAM) { flag aIsQuietNaN, aIsSignalingNaN, bIsQuietNaN, bIsSignalingNaN; @@ -594,14 +630,16 @@ static float32 propagateFloat32NaN( float32 a, float32 b STATUS_PARAM) } } -/*---------------------------------------------------------------------------- -| Takes three single-precision floating-point values `a', `b' and `c', one of -| which is a NaN, and returns the appropriate NaN result. If any of `a', -| `b' or `c' is a signaling NaN, the invalid exception is raised. -| The input infzero indicates whether a*b was 0*inf or inf*0 (in which case -| obviously c is a NaN, and whether to propagate c or some other NaN is -| implementation defined). -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Takes three single-precision floating-point values `a', `b' and `c', one of +which is a NaN, and returns the appropriate NaN result. If any of `a', +`b' or `c' is a signaling NaN, the invalid exception is raised. +The input infzero indicates whether a*b was 0*inf or inf*0 (in which case +obviously c is a NaN, and whether to propagate c or some other NaN is +implementation defined). +------------------------------------------------------------------------------- +*/ static float32 propagateFloat32MulAddNaN(float32 a, float32 b, float32 c, flag infzero STATUS_PARAM) @@ -656,10 +694,12 @@ int float64_is_signaling_nan(float64 a_) return 0; } #else -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is a quiet -| NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is a quiet +NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float64_is_quiet_nan( float64 a_ ) { @@ -673,10 +713,12 @@ int float64_is_quiet_nan( float64 a_ ) #endif } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is a signaling -| NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is a signaling +NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float64_is_signaling_nan( float64 a_ ) { @@ -691,10 +733,12 @@ int float64_is_signaling_nan( float64 a_ ) } #endif -/*---------------------------------------------------------------------------- -| Returns a quiet NaN if the double-precision floating point value `a' is a -| signaling NaN; otherwise returns `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns a quiet NaN if the double-precision floating point value `a' is a +signaling NaN; otherwise returns `a'. +------------------------------------------------------------------------------- +*/ float64 float64_maybe_silence_nan( float64 a_ ) { @@ -714,12 +758,13 @@ float64 float64_maybe_silence_nan( float64 a_ ) return a_; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point NaN -| `a' to the canonical NaN format. If `a' is a signaling NaN, the invalid -| exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point NaN +`a' to the canonical NaN format. If `a' is a signaling NaN, the invalid +exception is raised. +------------------------------------------------------------------------------- +*/ static commonNaNT float64ToCommonNaN( float64 a STATUS_PARAM) { commonNaNT z; @@ -731,10 +776,12 @@ static commonNaNT float64ToCommonNaN( float64 a STATUS_PARAM) return z; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the canonical NaN `a' to the double- -| precision floating-point format. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the canonical NaN `a' to the double- +precision floating-point format. +------------------------------------------------------------------------------- +*/ static float64 commonNaNToFloat64( commonNaNT a STATUS_PARAM) { @@ -753,12 +800,13 @@ static float64 commonNaNToFloat64( commonNaNT a STATUS_PARAM) return float64_default_nan; } -/*---------------------------------------------------------------------------- -| Takes two double-precision floating-point values `a' and `b', one of which -| is a NaN, and returns the appropriate NaN result. If either `a' or `b' is a -| signaling NaN, the invalid exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes two double-precision floating-point values `a' and `b', one of which +is a NaN, and returns the appropriate NaN result. If either `a' or `b' is a +signaling NaN, the invalid exception is raised. +------------------------------------------------------------------------------- +*/ static float64 propagateFloat64NaN( float64 a, float64 b STATUS_PARAM) { flag aIsQuietNaN, aIsSignalingNaN, bIsQuietNaN, bIsSignalingNaN; @@ -793,14 +841,16 @@ static float64 propagateFloat64NaN( float64 a, float64 b STATUS_PARAM) } } -/*---------------------------------------------------------------------------- -| Takes three double-precision floating-point values `a', `b' and `c', one of -| which is a NaN, and returns the appropriate NaN result. If any of `a', -| `b' or `c' is a signaling NaN, the invalid exception is raised. -| The input infzero indicates whether a*b was 0*inf or inf*0 (in which case -| obviously c is a NaN, and whether to propagate c or some other NaN is -| implementation defined). -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Takes three double-precision floating-point values `a', `b' and `c', one of +which is a NaN, and returns the appropriate NaN result. If any of `a', +`b' or `c' is a signaling NaN, the invalid exception is raised. +The input infzero indicates whether a*b was 0*inf or inf*0 (in which case +obviously c is a NaN, and whether to propagate c or some other NaN is +implementation defined). +------------------------------------------------------------------------------- +*/ static float64 propagateFloat64MulAddNaN(float64 a, float64 b, float64 c, flag infzero STATUS_PARAM) @@ -855,11 +905,13 @@ int floatx80_is_signaling_nan(floatx80 a_) return 0; } #else -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is a -| quiet NaN; otherwise returns 0. This slightly differs from the same -| function for other types as floatx80 has an explicit bit. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is a +quiet NaN; otherwise returns 0. This slightly differs from the same +function for other types as floatx80 has an explicit bit. +------------------------------------------------------------------------------- +*/ int floatx80_is_quiet_nan( floatx80 a ) { @@ -877,11 +929,13 @@ int floatx80_is_quiet_nan( floatx80 a ) #endif } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is a -| signaling NaN; otherwise returns 0. This slightly differs from the same -| function for other types as floatx80 has an explicit bit. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is a +signaling NaN; otherwise returns 0. This slightly differs from the same +function for other types as floatx80 has an explicit bit. +------------------------------------------------------------------------------- +*/ int floatx80_is_signaling_nan( floatx80 a ) { @@ -900,10 +954,12 @@ int floatx80_is_signaling_nan( floatx80 a ) } #endif -/*---------------------------------------------------------------------------- -| Returns a quiet NaN if the extended double-precision floating point value -| `a' is a signaling NaN; otherwise returns `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns a quiet NaN if the extended double-precision floating point value +`a' is a signaling NaN; otherwise returns `a'. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_maybe_silence_nan( floatx80 a ) { @@ -923,12 +979,13 @@ floatx80 floatx80_maybe_silence_nan( floatx80 a ) return a; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point NaN `a' to the canonical NaN format. If `a' is a signaling NaN, the -| invalid exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point NaN `a' to the canonical NaN format. If `a' is a signaling NaN, the +invalid exception is raised. +------------------------------------------------------------------------------- +*/ static commonNaNT floatx80ToCommonNaN( floatx80 a STATUS_PARAM) { commonNaNT z; @@ -946,10 +1003,12 @@ static commonNaNT floatx80ToCommonNaN( floatx80 a STATUS_PARAM) return z; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the canonical NaN `a' to the extended -| double-precision floating-point format. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the canonical NaN `a' to the extended +double-precision floating-point format. +------------------------------------------------------------------------------- +*/ static floatx80 commonNaNToFloatx80( commonNaNT a STATUS_PARAM) { @@ -972,12 +1031,13 @@ static floatx80 commonNaNToFloatx80( commonNaNT a STATUS_PARAM) return z; } -/*---------------------------------------------------------------------------- -| Takes two extended double-precision floating-point values `a' and `b', one -| of which is a NaN, and returns the appropriate NaN result. If either `a' or -| `b' is a signaling NaN, the invalid exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes two extended double-precision floating-point values `a' and `b', one +of which is a NaN, and returns the appropriate NaN result. If either `a' or +`b' is a signaling NaN, the invalid exception is raised. +------------------------------------------------------------------------------- +*/ static floatx80 propagateFloatx80NaN( floatx80 a, floatx80 b STATUS_PARAM) { flag aIsQuietNaN, aIsSignalingNaN, bIsQuietNaN, bIsSignalingNaN; @@ -1023,10 +1083,12 @@ int float128_is_signaling_nan(float128 a_) return 0; } #else -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is a quiet -| NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is a quiet +NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float128_is_quiet_nan( float128 a ) { @@ -1041,10 +1103,12 @@ int float128_is_quiet_nan( float128 a ) #endif } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is a -| signaling NaN; otherwise returns 0. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is a +signaling NaN; otherwise returns 0. +------------------------------------------------------------------------------- +*/ int float128_is_signaling_nan( float128 a ) { @@ -1060,10 +1124,12 @@ int float128_is_signaling_nan( float128 a ) } #endif -/*---------------------------------------------------------------------------- -| Returns a quiet NaN if the quadruple-precision floating point value `a' is -| a signaling NaN; otherwise returns `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns a quiet NaN if the quadruple-precision floating point value `a' is +a signaling NaN; otherwise returns `a'. +------------------------------------------------------------------------------- +*/ float128 float128_maybe_silence_nan( float128 a ) { @@ -1083,12 +1149,13 @@ float128 float128_maybe_silence_nan( float128 a ) return a; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point NaN -| `a' to the canonical NaN format. If `a' is a signaling NaN, the invalid -| exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point NaN +`a' to the canonical NaN format. If `a' is a signaling NaN, the invalid +exception is raised. +------------------------------------------------------------------------------- +*/ static commonNaNT float128ToCommonNaN( float128 a STATUS_PARAM) { commonNaNT z; @@ -1099,10 +1166,12 @@ static commonNaNT float128ToCommonNaN( float128 a STATUS_PARAM) return z; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the canonical NaN `a' to the quadruple- -| precision floating-point format. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the canonical NaN `a' to the quadruple- +precision floating-point format. +------------------------------------------------------------------------------- +*/ static float128 commonNaNToFloat128( commonNaNT a STATUS_PARAM) { @@ -1119,12 +1188,13 @@ static float128 commonNaNToFloat128( commonNaNT a STATUS_PARAM) return z; } -/*---------------------------------------------------------------------------- -| Takes two quadruple-precision floating-point values `a' and `b', one of -| which is a NaN, and returns the appropriate NaN result. If either `a' or -| `b' is a signaling NaN, the invalid exception is raised. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes two quadruple-precision floating-point values `a' and `b', one of +which is a NaN, and returns the appropriate NaN result. If either `a' or +`b' is a signaling NaN, the invalid exception is raised. +------------------------------------------------------------------------------- +*/ static float128 propagateFloat128NaN( float128 a, float128 b STATUS_PARAM) { flag aIsQuietNaN, aIsSignalingNaN, bIsQuietNaN, bIsSignalingNaN; diff --git a/fpu/softfloat.c b/fpu/softfloat.c index 7ba51b6..9145582 100644 --- a/fpu/softfloat.c +++ b/fpu/softfloat.c @@ -4,10 +4,11 @@ * Derived from SoftFloat. */ -/*============================================================================ +/* +=============================================================================== -This C source file is part of the SoftFloat IEC/IEEE Floating-point Arithmetic -Package, Release 2b. +This C source file is part of the SoftFloat IEC/IEEE Floating-point +Arithmetic Package, Release 2a. Written by John R. Hauser. This work was made possible in part by the International Computer Science Institute, located at Suite 600, 1947 Center @@ -16,24 +17,22 @@ National Science Foundation under grant MIP-9311980. The original version of this code was written as part of a project to build a fixed-point vector processor in collaboration with the University of California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. More information -is available through the Web page `http://www.cs.berkeley.edu/~jhauser/ +is available through the Web page `http://HTTP.CS.Berkeley.EDU/~jhauser/ arithmetic/SoftFloat.html'. -THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort has -been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT TIMES -RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO PERSONS -AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ALL LOSSES, -COSTS, OR OTHER PROBLEMS THEY INCUR DUE TO THE SOFTWARE, AND WHO FURTHERMORE -EFFECTIVELY INDEMNIFY JOHN HAUSER AND THE INTERNATIONAL COMPUTER SCIENCE -INSTITUTE (possibly via similar legal warning) AGAINST ALL LOSSES, COSTS, OR -OTHER PROBLEMS INCURRED BY THEIR CUSTOMERS AND CLIENTS DUE TO THE SOFTWARE. +THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort +has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT +TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO +PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY +AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. Derivative works are acceptable, even for commercial purposes, so long as -(1) the source code for the derivative work includes prominent notice that -the work is derivative, and (2) the source code includes prominent notice with -these four paragraphs for those parts of this code that are retained. +(1) they include prominent notice that the work is derivative, and (2) they +include prominent notice akin to these four paragraphs for those parts of +this code that are retained. -=============================================================================*/ +=============================================================================== +*/ /* softfloat (and in particular the code in softfloat-specialize.h) is * target-dependent and needs the TARGET_* macros. @@ -42,21 +41,25 @@ these four paragraphs for those parts of this code that are retained. #include "fpu/softfloat.h" -/*---------------------------------------------------------------------------- -| Primitive arithmetic functions, including multi-word arithmetic, and -| division and square root approximations. (Can be specialized to target if -| desired.) -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Primitive arithmetic functions, including multi-word arithmetic, and +division and square root approximations. (Can be specialized to target if +desired.) +------------------------------------------------------------------------------- +*/ #include "softfloat-macros.h" -/*---------------------------------------------------------------------------- -| Functions and definitions to determine: (1) whether tininess for underflow -| is detected before or after rounding by default, (2) what (if anything) -| happens when exceptions are raised, (3) how signaling NaNs are distinguished -| from quiet NaNs, (4) the default generated quiet NaNs, and (5) how NaNs -| are propagated from function inputs to output. These details are target- -| specific. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Functions and definitions to determine: (1) whether tininess for underflow +is detected before or after rounding by default, (2) what (if anything) +happens when exceptions are raised, (3) how signaling NaNs are distinguished +from quiet NaNs, (4) the default generated quiet NaNs, and (5) how NaNs +are propagated from function inputs to output. These details are target- +specific. +------------------------------------------------------------------------------- +*/ #include "softfloat-specialize.h" void set_float_rounding_mode(int val STATUS_PARAM) @@ -74,43 +77,51 @@ void set_floatx80_rounding_precision(int val STATUS_PARAM) STATUS(floatx80_rounding_precision) = val; } -/*---------------------------------------------------------------------------- -| Returns the fraction bits of the half-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the fraction bits of the half-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE uint32_t extractFloat16Frac(float16 a) { return float16_val(a) & 0x3ff; } -/*---------------------------------------------------------------------------- -| Returns the exponent bits of the half-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the exponent bits of the half-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE int_fast16_t extractFloat16Exp(float16 a) { return (float16_val(a) >> 10) & 0x1f; } -/*---------------------------------------------------------------------------- -| Returns the sign bit of the single-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the sign bit of the single-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE flag extractFloat16Sign(float16 a) { return float16_val(a)>>15; } -/*---------------------------------------------------------------------------- -| Takes a 64-bit fixed-point value `absZ' with binary point between bits 6 -| and 7, and returns the properly rounded 32-bit integer corresponding to the -| input. If `zSign' is 1, the input is negated before being converted to an -| integer. Bit 63 of `absZ' must be zero. Ordinarily, the fixed-point input -| is simply rounded to an integer, with the inexact exception raised if the -| input cannot be represented exactly as an integer. However, if the fixed- -| point input is too large, the invalid exception is raised and the largest -| positive or negative integer is returned. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Takes a 64-bit fixed-point value `absZ' with binary point between bits 6 +and 7, and returns the properly rounded 32-bit integer corresponding to the +input. If `zSign' is 1, the input is negated before being converted to an +integer. Bit 63 of `absZ' must be zero. Ordinarily, the fixed-point input +is simply rounded to an integer, with the inexact exception raised if the +input cannot be represented exactly as an integer. However, if the fixed- +point input is too large, the invalid exception is raised and the largest +positive or negative integer is returned. +------------------------------------------------------------------------------- +*/ static int32 roundAndPackInt32( flag zSign, uint64_t absZ STATUS_PARAM) { @@ -150,17 +161,19 @@ static int32 roundAndPackInt32( flag zSign, uint64_t absZ STATUS_PARAM) } -/*---------------------------------------------------------------------------- -| Takes the 128-bit fixed-point value formed by concatenating `absZ0' and -| `absZ1', with binary point between bits 63 and 64 (between the input words), -| and returns the properly rounded 64-bit integer corresponding to the input. -| If `zSign' is 1, the input is negated before being converted to an integer. -| Ordinarily, the fixed-point input is simply rounded to an integer, with -| the inexact exception raised if the input cannot be represented exactly as -| an integer. However, if the fixed-point input is too large, the invalid -| exception is raised and the largest positive or negative integer is -| returned. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Takes the 128-bit fixed-point value formed by concatenating `absZ0' and +`absZ1', with binary point between bits 63 and 64 (between the input words), +and returns the properly rounded 64-bit integer corresponding to the input. +If `zSign' is 1, the input is negated before being converted to an integer. +Ordinarily, the fixed-point input is simply rounded to an integer, with +the inexact exception raised if the input cannot be represented exactly as +an integer. However, if the fixed-point input is too large, the invalid +exception is raised and the largest positive or negative integer is +returned. +------------------------------------------------------------------------------- +*/ static int64 roundAndPackInt64( flag zSign, uint64_t absZ0, uint64_t absZ1 STATUS_PARAM) { @@ -203,9 +216,11 @@ static int64 roundAndPackInt64( flag zSign, uint64_t absZ0, uint64_t absZ1 STATU } -/*---------------------------------------------------------------------------- -| Returns the fraction bits of the single-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the fraction bits of the single-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE uint32_t extractFloat32Frac( float32 a ) { @@ -214,9 +229,11 @@ INLINE uint32_t extractFloat32Frac( float32 a ) } -/*---------------------------------------------------------------------------- -| Returns the exponent bits of the single-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the exponent bits of the single-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE int_fast16_t extractFloat32Exp(float32 a) { @@ -225,10 +242,11 @@ INLINE int_fast16_t extractFloat32Exp(float32 a) } -/*---------------------------------------------------------------------------- -| Returns the sign bit of the single-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the sign bit of the single-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE flag extractFloat32Sign( float32 a ) { @@ -236,10 +254,12 @@ INLINE flag extractFloat32Sign( float32 a ) } -/*---------------------------------------------------------------------------- -| If `a' is denormal and we are in flush-to-zero mode then set the -| input-denormal exception and return zero. Otherwise just return the value. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +If `a' is denormal and we are in flush-to-zero mode then set the +input-denormal exception and return zero. Otherwise just return the value. +------------------------------------------------------------------------------- +*/ static float32 float32_squash_input_denormal(float32 a STATUS_PARAM) { if (STATUS(flush_inputs_to_zero)) { @@ -251,13 +271,14 @@ static float32 float32_squash_input_denormal(float32 a STATUS_PARAM) return a; } -/*---------------------------------------------------------------------------- -| Normalizes the subnormal single-precision floating-point value represented -| by the denormalized significand `aSig'. The normalized exponent and -| significand are stored at the locations pointed to by `zExpPtr' and -| `zSigPtr', respectively. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Normalizes the subnormal single-precision floating-point value represented +by the denormalized significand `aSig'. The normalized exponent and +significand are stored at the locations pointed to by `zExpPtr' and +`zSigPtr', respectively. +------------------------------------------------------------------------------- +*/ static void normalizeFloat32Subnormal(uint32_t aSig, int_fast16_t *zExpPtr, uint32_t *zSigPtr) { @@ -269,16 +290,18 @@ static void } -/*---------------------------------------------------------------------------- -| Packs the sign `zSign', exponent `zExp', and significand `zSig' into a -| single-precision floating-point value, returning the result. After being -| shifted into the proper positions, the three fields are simply added -| together to form the result. This means that any integer portion of `zSig' -| will be added into the exponent. Since a properly normalized significand -| will have an integer portion equal to 1, the `zExp' input should be 1 less -| than the desired result exponent whenever `zSig' is a complete, normalized -| significand. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Packs the sign `zSign', exponent `zExp', and significand `zSig' into a +single-precision floating-point value, returning the result. After being +shifted into the proper positions, the three fields are simply added +together to form the result. This means that any integer portion of `zSig' +will be added into the exponent. Since a properly normalized significand +will have an integer portion equal to 1, the `zExp' input should be 1 less +than the desired result exponent whenever `zSig' is a complete, normalized +significand. +------------------------------------------------------------------------------- +*/ INLINE float32 packFloat32(flag zSign, int_fast16_t zExp, uint32_t zSig) { @@ -288,27 +311,29 @@ INLINE float32 packFloat32(flag zSign, int_fast16_t zExp, uint32_t zSig) } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent `zExp', -| and significand `zSig', and returns the proper single-precision floating- -| point value corresponding to the abstract input. Ordinarily, the abstract -| value is simply rounded and packed into the single-precision format, with -| the inexact exception raised if the abstract input cannot be represented -| exactly. However, if the abstract value is too large, the overflow and -| inexact exceptions are raised and an infinity or maximal finite value is -| returned. If the abstract value is too small, the input value is rounded to -| a subnormal number, and the underflow and inexact exceptions are raised if -| the abstract input cannot be represented exactly as a subnormal single- -| precision floating-point number. -| The input significand `zSig' has its binary point between bits 30 -| and 29, which is 7 bits to the left of the usual location. This shifted -| significand must be normalized or smaller. If `zSig' is not normalized, -| `zExp' must be 0; in that case, the result returned is a subnormal number, -| and it must not require rounding. In the usual case that `zSig' is -| normalized, `zExp' must be 1 less than the ``true'' floating-point exponent. -| The handling of underflow and overflow follows the IEC/IEEE Standard for -| Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent `zExp', +and significand `zSig', and returns the proper single-precision floating- +point value corresponding to the abstract input. Ordinarily, the abstract +value is simply rounded and packed into the single-precision format, with +the inexact exception raised if the abstract input cannot be represented +exactly. However, if the abstract value is too large, the overflow and +inexact exceptions are raised and an infinity or maximal finite value is +returned. If the abstract value is too small, the input value is rounded to +a subnormal number, and the underflow and inexact exceptions are raised if +the abstract input cannot be represented exactly as a subnormal single- +precision floating-point number. + The input significand `zSig' has its binary point between bits 30 +and 29, which is 7 bits to the left of the usual location. This shifted +significand must be normalized or smaller. If `zSig' is not normalized, +`zExp' must be 0; in that case, the result returned is a subnormal number, +and it must not require rounding. In the usual case that `zSig' is +normalized, `zExp' must be 1 less than the ``true'' floating-point exponent. +The handling of underflow and overflow follows the IEC/IEEE Standard for +Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static float32 roundAndPackFloat32(flag zSign, int_fast16_t zExp, uint32_t zSig STATUS_PARAM) { @@ -366,15 +391,16 @@ static float32 roundAndPackFloat32(flag zSign, int_fast16_t zExp, uint32_t zSig } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent `zExp', -| and significand `zSig', and returns the proper single-precision floating- -| point value corresponding to the abstract input. This routine is just like -| `roundAndPackFloat32' except that `zSig' does not have to be normalized. -| Bit 31 of `zSig' must be zero, and `zExp' must be 1 less than the ``true'' -| floating-point exponent. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent `zExp', +and significand `zSig', and returns the proper single-precision floating- +point value corresponding to the abstract input. This routine is just like +`roundAndPackFloat32' except that `zSig' does not have to be normalized. +Bit 31 of `zSig' must be zero, and `zExp' must be 1 less than the ``true'' +floating-point exponent. +------------------------------------------------------------------------------- +*/ static float32 normalizeRoundAndPackFloat32(flag zSign, int_fast16_t zExp, uint32_t zSig STATUS_PARAM) { @@ -385,9 +411,11 @@ static float32 } -/*---------------------------------------------------------------------------- -| Returns the fraction bits of the double-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the fraction bits of the double-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE uint64_t extractFloat64Frac( float64 a ) { @@ -396,9 +424,11 @@ INLINE uint64_t extractFloat64Frac( float64 a ) } -/*---------------------------------------------------------------------------- -| Returns the exponent bits of the double-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the exponent bits of the double-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE int_fast16_t extractFloat64Exp(float64 a) { @@ -407,10 +437,11 @@ INLINE int_fast16_t extractFloat64Exp(float64 a) } -/*---------------------------------------------------------------------------- -| Returns the sign bit of the double-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the sign bit of the double-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE flag extractFloat64Sign( float64 a ) { @@ -418,10 +449,12 @@ INLINE flag extractFloat64Sign( float64 a ) } -/*---------------------------------------------------------------------------- -| If `a' is denormal and we are in flush-to-zero mode then set the -| input-denormal exception and return zero. Otherwise just return the value. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +If `a' is denormal and we are in flush-to-zero mode then set the +input-denormal exception and return zero. Otherwise just return the value. +------------------------------------------------------------------------------- +*/ static float64 float64_squash_input_denormal(float64 a STATUS_PARAM) { if (STATUS(flush_inputs_to_zero)) { @@ -433,13 +466,14 @@ static float64 float64_squash_input_denormal(float64 a STATUS_PARAM) return a; } -/*---------------------------------------------------------------------------- -| Normalizes the subnormal double-precision floating-point value represented -| by the denormalized significand `aSig'. The normalized exponent and -| significand are stored at the locations pointed to by `zExpPtr' and -| `zSigPtr', respectively. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Normalizes the subnormal double-precision floating-point value represented +by the denormalized significand `aSig'. The normalized exponent and +significand are stored at the locations pointed to by `zExpPtr' and +`zSigPtr', respectively. +------------------------------------------------------------------------------- +*/ static void normalizeFloat64Subnormal(uint64_t aSig, int_fast16_t *zExpPtr, uint64_t *zSigPtr) { @@ -451,16 +485,18 @@ static void } -/*---------------------------------------------------------------------------- -| Packs the sign `zSign', exponent `zExp', and significand `zSig' into a -| double-precision floating-point value, returning the result. After being -| shifted into the proper positions, the three fields are simply added -| together to form the result. This means that any integer portion of `zSig' -| will be added into the exponent. Since a properly normalized significand -| will have an integer portion equal to 1, the `zExp' input should be 1 less -| than the desired result exponent whenever `zSig' is a complete, normalized -| significand. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Packs the sign `zSign', exponent `zExp', and significand `zSig' into a +double-precision floating-point value, returning the result. After being +shifted into the proper positions, the three fields are simply added +together to form the result. This means that any integer portion of `zSig' +will be added into the exponent. Since a properly normalized significand +will have an integer portion equal to 1, the `zExp' input should be 1 less +than the desired result exponent whenever `zSig' is a complete, normalized +significand. +------------------------------------------------------------------------------- +*/ INLINE float64 packFloat64(flag zSign, int_fast16_t zExp, uint64_t zSig) { @@ -470,27 +506,29 @@ INLINE float64 packFloat64(flag zSign, int_fast16_t zExp, uint64_t zSig) } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent `zExp', -| and significand `zSig', and returns the proper double-precision floating- -| point value corresponding to the abstract input. Ordinarily, the abstract -| value is simply rounded and packed into the double-precision format, with -| the inexact exception raised if the abstract input cannot be represented -| exactly. However, if the abstract value is too large, the overflow and -| inexact exceptions are raised and an infinity or maximal finite value is -| returned. If the abstract value is too small, the input value is rounded -| to a subnormal number, and the underflow and inexact exceptions are raised -| if the abstract input cannot be represented exactly as a subnormal double- -| precision floating-point number. -| The input significand `zSig' has its binary point between bits 62 -| and 61, which is 10 bits to the left of the usual location. This shifted -| significand must be normalized or smaller. If `zSig' is not normalized, -| `zExp' must be 0; in that case, the result returned is a subnormal number, -| and it must not require rounding. In the usual case that `zSig' is -| normalized, `zExp' must be 1 less than the ``true'' floating-point exponent. -| The handling of underflow and overflow follows the IEC/IEEE Standard for -| Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent `zExp', +and significand `zSig', and returns the proper double-precision floating- +point value corresponding to the abstract input. Ordinarily, the abstract +value is simply rounded and packed into the double-precision format, with +the inexact exception raised if the abstract input cannot be represented +exactly. However, if the abstract value is too large, the overflow and +inexact exceptions are raised and an infinity or maximal finite value is +returned. If the abstract value is too small, the input value is rounded +to a subnormal number, and the underflow and inexact exceptions are raised +if the abstract input cannot be represented exactly as a subnormal double- +precision floating-point number. + The input significand `zSig' has its binary point between bits 62 +and 61, which is 10 bits to the left of the usual location. This shifted +significand must be normalized or smaller. If `zSig' is not normalized, +`zExp' must be 0; in that case, the result returned is a subnormal number, +and it must not require rounding. In the usual case that `zSig' is +normalized, `zExp' must be 1 less than the ``true'' floating-point exponent. +The handling of underflow and overflow follows the IEC/IEEE Standard for +Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static float64 roundAndPackFloat64(flag zSign, int_fast16_t zExp, uint64_t zSig STATUS_PARAM) { @@ -548,15 +586,16 @@ static float64 roundAndPackFloat64(flag zSign, int_fast16_t zExp, uint64_t zSig } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent `zExp', -| and significand `zSig', and returns the proper double-precision floating- -| point value corresponding to the abstract input. This routine is just like -| `roundAndPackFloat64' except that `zSig' does not have to be normalized. -| Bit 63 of `zSig' must be zero, and `zExp' must be 1 less than the ``true'' -| floating-point exponent. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent `zExp', +and significand `zSig', and returns the proper double-precision floating- +point value corresponding to the abstract input. This routine is just like +`roundAndPackFloat64' except that `zSig' does not have to be normalized. +Bit 63 of `zSig' must be zero, and `zExp' must be 1 less than the ``true'' +floating-point exponent. +------------------------------------------------------------------------------- +*/ static float64 normalizeRoundAndPackFloat64(flag zSign, int_fast16_t zExp, uint64_t zSig STATUS_PARAM) { @@ -567,10 +606,12 @@ static float64 } -/*---------------------------------------------------------------------------- -| Returns the fraction bits of the extended double-precision floating-point -| value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the fraction bits of the extended double-precision floating-point +value `a'. +------------------------------------------------------------------------------- +*/ INLINE uint64_t extractFloatx80Frac( floatx80 a ) { @@ -579,11 +620,12 @@ INLINE uint64_t extractFloatx80Frac( floatx80 a ) } -/*---------------------------------------------------------------------------- -| Returns the exponent bits of the extended double-precision floating-point -| value `a'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the exponent bits of the extended double-precision floating-point +value `a'. +------------------------------------------------------------------------------- +*/ INLINE int32 extractFloatx80Exp( floatx80 a ) { @@ -591,11 +633,12 @@ INLINE int32 extractFloatx80Exp( floatx80 a ) } -/*---------------------------------------------------------------------------- -| Returns the sign bit of the extended double-precision floating-point value -| `a'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the sign bit of the extended double-precision floating-point value +`a'. +------------------------------------------------------------------------------- +*/ INLINE flag extractFloatx80Sign( floatx80 a ) { @@ -603,13 +646,14 @@ INLINE flag extractFloatx80Sign( floatx80 a ) } -/*---------------------------------------------------------------------------- -| Normalizes the subnormal extended double-precision floating-point value -| represented by the denormalized significand `aSig'. The normalized exponent -| and significand are stored at the locations pointed to by `zExpPtr' and -| `zSigPtr', respectively. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Normalizes the subnormal extended double-precision floating-point value +represented by the denormalized significand `aSig'. The normalized exponent +and significand are stored at the locations pointed to by `zExpPtr' and +`zSigPtr', respectively. +------------------------------------------------------------------------------- +*/ static void normalizeFloatx80Subnormal( uint64_t aSig, int32 *zExpPtr, uint64_t *zSigPtr ) { @@ -621,10 +665,12 @@ static void } -/*---------------------------------------------------------------------------- -| Packs the sign `zSign', exponent `zExp', and significand `zSig' into an -| extended double-precision floating-point value, returning the result. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Packs the sign `zSign', exponent `zExp', and significand `zSig' into an +extended double-precision floating-point value, returning the result. +------------------------------------------------------------------------------- +*/ INLINE floatx80 packFloatx80( flag zSign, int32 zExp, uint64_t zSig ) { @@ -636,30 +682,31 @@ INLINE floatx80 packFloatx80( flag zSign, int32 zExp, uint64_t zSig ) } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent `zExp', -| and extended significand formed by the concatenation of `zSig0' and `zSig1', -| and returns the proper extended double-precision floating-point value -| corresponding to the abstract input. Ordinarily, the abstract value is -| rounded and packed into the extended double-precision format, with the -| inexact exception raised if the abstract input cannot be represented -| exactly. However, if the abstract value is too large, the overflow and -| inexact exceptions are raised and an infinity or maximal finite value is -| returned. If the abstract value is too small, the input value is rounded to -| a subnormal number, and the underflow and inexact exceptions are raised if -| the abstract input cannot be represented exactly as a subnormal extended -| double-precision floating-point number. -| If `roundingPrecision' is 32 or 64, the result is rounded to the same -| number of bits as single or double precision, respectively. Otherwise, the -| result is rounded to the full precision of the extended double-precision -| format. -| The input significand must be normalized or smaller. If the input -| significand is not normalized, `zExp' must be 0; in that case, the result -| returned is a subnormal number, and it must not require rounding. The -| handling of underflow and overflow follows the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent `zExp', +and extended significand formed by the concatenation of `zSig0' and `zSig1', +and returns the proper extended double-precision floating-point value +corresponding to the abstract input. Ordinarily, the abstract value is +rounded and packed into the extended double-precision format, with the +inexact exception raised if the abstract input cannot be represented +exactly. However, if the abstract value is too large, the overflow and +inexact exceptions are raised and an infinity or maximal finite value is +returned. If the abstract value is too small, the input value is rounded to +a subnormal number, and the underflow and inexact exceptions are raised if +the abstract input cannot be represented exactly as a subnormal extended +double-precision floating-point number. + If `roundingPrecision' is 32 or 64, the result is rounded to the same +number of bits as single or double precision, respectively. Otherwise, the +result is rounded to the full precision of the extended double-precision +format. + The input significand must be normalized or smaller. If the input +significand is not normalized, `zExp' must be 0; in that case, the result +returned is a subnormal number, and it must not require rounding. The +handling of underflow and overflow follows the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static floatx80 roundAndPackFloatx80( int8 roundingPrecision, flag zSign, int32 zExp, uint64_t zSig0, uint64_t zSig1 @@ -823,15 +870,16 @@ static floatx80 } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent -| `zExp', and significand formed by the concatenation of `zSig0' and `zSig1', -| and returns the proper extended double-precision floating-point value -| corresponding to the abstract input. This routine is just like -| `roundAndPackFloatx80' except that the input significand does not have to be -| normalized. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent +`zExp', and significand formed by the concatenation of `zSig0' and `zSig1', +and returns the proper extended double-precision floating-point value +corresponding to the abstract input. This routine is just like +`roundAndPackFloatx80' except that the input significand does not have to be +normalized. +------------------------------------------------------------------------------- +*/ static floatx80 normalizeRoundAndPackFloatx80( int8 roundingPrecision, flag zSign, int32 zExp, uint64_t zSig0, uint64_t zSig1 @@ -852,10 +900,12 @@ static floatx80 } -/*---------------------------------------------------------------------------- -| Returns the least-significant 64 fraction bits of the quadruple-precision -| floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the least-significant 64 fraction bits of the quadruple-precision +floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE uint64_t extractFloat128Frac1( float128 a ) { @@ -864,10 +914,12 @@ INLINE uint64_t extractFloat128Frac1( float128 a ) } -/*---------------------------------------------------------------------------- -| Returns the most-significant 48 fraction bits of the quadruple-precision -| floating-point value `a'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the most-significant 48 fraction bits of the quadruple-precision +floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE uint64_t extractFloat128Frac0( float128 a ) { @@ -876,11 +928,12 @@ INLINE uint64_t extractFloat128Frac0( float128 a ) } -/*---------------------------------------------------------------------------- -| Returns the exponent bits of the quadruple-precision floating-point value -| `a'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the exponent bits of the quadruple-precision floating-point value +`a'. +------------------------------------------------------------------------------- +*/ INLINE int32 extractFloat128Exp( float128 a ) { @@ -888,10 +941,11 @@ INLINE int32 extractFloat128Exp( float128 a ) } -/*---------------------------------------------------------------------------- -| Returns the sign bit of the quadruple-precision floating-point value `a'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the sign bit of the quadruple-precision floating-point value `a'. +------------------------------------------------------------------------------- +*/ INLINE flag extractFloat128Sign( float128 a ) { @@ -899,16 +953,17 @@ INLINE flag extractFloat128Sign( float128 a ) } -/*---------------------------------------------------------------------------- -| Normalizes the subnormal quadruple-precision floating-point value -| represented by the denormalized significand formed by the concatenation of -| `aSig0' and `aSig1'. The normalized exponent is stored at the location -| pointed to by `zExpPtr'. The most significant 49 bits of the normalized -| significand are stored at the location pointed to by `zSig0Ptr', and the -| least significant 64 bits of the normalized significand are stored at the -| location pointed to by `zSig1Ptr'. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Normalizes the subnormal quadruple-precision floating-point value +represented by the denormalized significand formed by the concatenation of +`aSig0' and `aSig1'. The normalized exponent is stored at the location +pointed to by `zExpPtr'. The most significant 49 bits of the normalized +significand are stored at the location pointed to by `zSig0Ptr', and the +least significant 64 bits of the normalized significand are stored at the +location pointed to by `zSig1Ptr'. +------------------------------------------------------------------------------- +*/ static void normalizeFloat128Subnormal( uint64_t aSig0, @@ -940,19 +995,20 @@ static void } -/*---------------------------------------------------------------------------- -| Packs the sign `zSign', the exponent `zExp', and the significand formed -| by the concatenation of `zSig0' and `zSig1' into a quadruple-precision -| floating-point value, returning the result. After being shifted into the -| proper positions, the three fields `zSign', `zExp', and `zSig0' are simply -| added together to form the most significant 32 bits of the result. This -| means that any integer portion of `zSig0' will be added into the exponent. -| Since a properly normalized significand will have an integer portion equal -| to 1, the `zExp' input should be 1 less than the desired result exponent -| whenever `zSig0' and `zSig1' concatenated form a complete, normalized -| significand. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Packs the sign `zSign', the exponent `zExp', and the significand formed +by the concatenation of `zSig0' and `zSig1' into a quadruple-precision +floating-point value, returning the result. After being shifted into the +proper positions, the three fields `zSign', `zExp', and `zSig0' are simply +added together to form the most significant 32 bits of the result. This +means that any integer portion of `zSig0' will be added into the exponent. +Since a properly normalized significand will have an integer portion equal +to 1, the `zExp' input should be 1 less than the desired result exponent +whenever `zSig0' and `zSig1' concatenated form a complete, normalized +significand. +------------------------------------------------------------------------------- +*/ INLINE float128 packFloat128( flag zSign, int32 zExp, uint64_t zSig0, uint64_t zSig1 ) { @@ -964,27 +1020,28 @@ INLINE float128 } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent `zExp', -| and extended significand formed by the concatenation of `zSig0', `zSig1', -| and `zSig2', and returns the proper quadruple-precision floating-point value -| corresponding to the abstract input. Ordinarily, the abstract value is -| simply rounded and packed into the quadruple-precision format, with the -| inexact exception raised if the abstract input cannot be represented -| exactly. However, if the abstract value is too large, the overflow and -| inexact exceptions are raised and an infinity or maximal finite value is -| returned. If the abstract value is too small, the input value is rounded to -| a subnormal number, and the underflow and inexact exceptions are raised if -| the abstract input cannot be represented exactly as a subnormal quadruple- -| precision floating-point number. -| The input significand must be normalized or smaller. If the input -| significand is not normalized, `zExp' must be 0; in that case, the result -| returned is a subnormal number, and it must not require rounding. In the -| usual case that the input significand is normalized, `zExp' must be 1 less -| than the ``true'' floating-point exponent. The handling of underflow and -| overflow follows the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent `zExp', +and extended significand formed by the concatenation of `zSig0', `zSig1', +and `zSig2', and returns the proper quadruple-precision floating-point value +corresponding to the abstract input. Ordinarily, the abstract value is +simply rounded and packed into the quadruple-precision format, with the +inexact exception raised if the abstract input cannot be represented +exactly. However, if the abstract value is too large, the overflow and +inexact exceptions are raised and an infinity or maximal finite value is +returned. If the abstract value is too small, the input value is rounded to +a subnormal number, and the underflow and inexact exceptions are raised if +the abstract input cannot be represented exactly as a subnormal quadruple- +precision floating-point number. + The input significand must be normalized or smaller. If the input +significand is not normalized, `zExp' must be 0; in that case, the result +returned is a subnormal number, and it must not require rounding. In the +usual case that the input significand is normalized, `zExp' must be 1 less +than the ``true'' floating-point exponent. The handling of underflow and +overflow follows the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static float128 roundAndPackFloat128( flag zSign, int32 zExp, uint64_t zSig0, uint64_t zSig1, uint64_t zSig2 STATUS_PARAM) @@ -1079,16 +1136,17 @@ static float128 } -/*---------------------------------------------------------------------------- -| Takes an abstract floating-point value having sign `zSign', exponent `zExp', -| and significand formed by the concatenation of `zSig0' and `zSig1', and -| returns the proper quadruple-precision floating-point value corresponding -| to the abstract input. This routine is just like `roundAndPackFloat128' -| except that the input significand has fewer bits and does not have to be -| normalized. In all cases, `zExp' must be 1 less than the ``true'' floating- -| point exponent. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Takes an abstract floating-point value having sign `zSign', exponent `zExp', +and significand formed by the concatenation of `zSig0' and `zSig1', and +returns the proper quadruple-precision floating-point value corresponding +to the abstract input. This routine is just like `roundAndPackFloat128' +except that the input significand has fewer bits and does not have to be +normalized. In all cases, `zExp' must be 1 less than the ``true'' floating- +point exponent. +------------------------------------------------------------------------------- +*/ static float128 normalizeRoundAndPackFloat128( flag zSign, int32 zExp, uint64_t zSig0, uint64_t zSig1 STATUS_PARAM) @@ -1115,13 +1173,14 @@ static float128 } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 32-bit two's complement integer `a' -| to the single-precision floating-point format. The conversion is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - -float32 int32_to_float32( int32 a STATUS_PARAM ) +/* +------------------------------------------------------------------------------- +Returns the result of converting the 32-bit two's complement integer `a' +to the single-precision floating-point format. The conversion is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ +float32 int32_to_float32( int32 a STATUS_PARAM) { flag zSign; @@ -1132,13 +1191,14 @@ float32 int32_to_float32( int32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 32-bit two's complement integer `a' -| to the double-precision floating-point format. The conversion is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - -float64 int32_to_float64( int32 a STATUS_PARAM ) +/* +------------------------------------------------------------------------------- +Returns the result of converting the 32-bit two's complement integer `a' +to the double-precision floating-point format. The conversion is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ +float64 int32_to_float64( int32 a STATUS_PARAM) { flag zSign; uint32 absA; @@ -1154,13 +1214,14 @@ float64 int32_to_float64( int32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 32-bit two's complement integer `a' -| to the extended double-precision floating-point format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the 32-bit two's complement integer `a' +to the extended double-precision floating-point format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 int32_to_floatx80( int32 a STATUS_PARAM ) { flag zSign; @@ -1177,12 +1238,13 @@ floatx80 int32_to_floatx80( int32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 32-bit two's complement integer `a' to -| the quadruple-precision floating-point format. The conversion is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the 32-bit two's complement integer `a' to +the quadruple-precision floating-point format. The conversion is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 int32_to_float128( int32 a STATUS_PARAM ) { flag zSign; @@ -1199,12 +1261,13 @@ float128 int32_to_float128( int32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 64-bit two's complement integer `a' -| to the single-precision floating-point format. The conversion is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the 64-bit two's complement integer `a' +to the single-precision floating-point format. The conversion is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 int64_to_float32( int64 a STATUS_PARAM ) { flag zSign; @@ -1252,12 +1315,13 @@ float32 uint64_to_float32( uint64 a STATUS_PARAM ) } } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 64-bit two's complement integer `a' -| to the double-precision floating-point format. The conversion is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the 64-bit two's complement integer `a' +to the double-precision floating-point format. The conversion is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 int64_to_float64( int64 a STATUS_PARAM ) { flag zSign; @@ -1285,13 +1349,14 @@ float64 uint64_to_float64(uint64 a STATUS_PARAM) return normalizeRoundAndPackFloat64(0, exp, a STATUS_VAR); } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 64-bit two's complement integer `a' -| to the extended double-precision floating-point format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the 64-bit two's complement integer `a' +to the extended double-precision floating-point format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 int64_to_floatx80( int64 a STATUS_PARAM ) { flag zSign; @@ -1306,12 +1371,13 @@ floatx80 int64_to_floatx80( int64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the 64-bit two's complement integer `a' to -| the quadruple-precision floating-point format. The conversion is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the 64-bit two's complement integer `a' to +the quadruple-precision floating-point format. The conversion is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 int64_to_float128( int64 a STATUS_PARAM ) { flag zSign; @@ -1347,16 +1413,17 @@ float128 uint64_to_float128(uint64 a STATUS_PARAM) return normalizeRoundAndPackFloat128(0, 0x406E, a, 0 STATUS_VAR); } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the 32-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic---which means in particular that the conversion is rounded -| according to the current rounding mode. If `a' is a NaN, the largest -| positive integer is returned. Otherwise, if the conversion overflows, the -| largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the 32-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic---which means in particular that the conversion is rounded +according to the current rounding mode. If `a' is a NaN, the largest +positive integer is returned. Otherwise, if the conversion overflows, the +largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int32 float32_to_int32( float32 a STATUS_PARAM ) { flag aSign; @@ -1378,16 +1445,17 @@ int32 float32_to_int32( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the 32-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. -| If `a' is a NaN, the largest positive integer is returned. Otherwise, if -| the conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the 32-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. +If `a' is a NaN, the largest positive integer is returned. Otherwise, if +the conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int32 float32_to_int32_round_to_zero( float32 a STATUS_PARAM ) { flag aSign; @@ -1421,15 +1489,17 @@ int32 float32_to_int32_round_to_zero( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the 16-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. -| If `a' is a NaN, the largest positive integer is returned. Otherwise, if -| the conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the 16-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. +If `a' is a NaN, the largest positive integer is returned. Otherwise, if +the conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int_fast16_t float32_to_int16_round_to_zero(float32 a STATUS_PARAM) { @@ -1470,16 +1540,17 @@ int_fast16_t float32_to_int16_round_to_zero(float32 a STATUS_PARAM) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the 64-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic---which means in particular that the conversion is rounded -| according to the current rounding mode. If `a' is a NaN, the largest -| positive integer is returned. Otherwise, if the conversion overflows, the -| largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the 64-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic---which means in particular that the conversion is rounded +according to the current rounding mode. If `a' is a NaN, the largest +positive integer is returned. Otherwise, if the conversion overflows, the +largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int64 float32_to_int64( float32 a STATUS_PARAM ) { flag aSign; @@ -1507,16 +1578,17 @@ int64 float32_to_int64( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the 64-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. If -| `a' is a NaN, the largest positive integer is returned. Otherwise, if the -| conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the 64-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. If +`a' is a NaN, the largest positive integer is returned. Otherwise, if the +conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int64 float32_to_int64_round_to_zero( float32 a STATUS_PARAM ) { flag aSign; @@ -1554,13 +1626,14 @@ int64 float32_to_int64_round_to_zero( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the double-precision floating-point format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the double-precision floating-point format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float32_to_float64( float32 a STATUS_PARAM ) { flag aSign; @@ -1584,13 +1657,14 @@ float64 float32_to_float64( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the extended double-precision floating-point format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the extended double-precision floating-point format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 float32_to_floatx80( float32 a STATUS_PARAM ) { flag aSign; @@ -1614,13 +1688,14 @@ floatx80 float32_to_floatx80( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the single-precision floating-point value -| `a' to the double-precision floating-point format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the single-precision floating-point value +`a' to the double-precision floating-point format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float32_to_float128( float32 a STATUS_PARAM ) { flag aSign; @@ -1644,14 +1719,15 @@ float128 float32_to_float128( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Rounds the single-precision floating-point value `a' to an integer, and -| returns the result as a single-precision floating-point value. The -| operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - -float32 float32_round_to_int( float32 a STATUS_PARAM) +/* +------------------------------------------------------------------------------- +Rounds the single-precision floating-point value `a' to an integer, and +returns the result as a single-precision floating-point value. The +operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ +float32 float32_round_to_int( float32 a STATUS_PARAM ) { flag aSign; int_fast16_t aExp; @@ -1704,15 +1780,16 @@ float32 float32_round_to_int( float32 a STATUS_PARAM) } -/*---------------------------------------------------------------------------- -| Returns the result of adding the absolute values of the single-precision -| floating-point values `a' and `b'. If `zSign' is 1, the sum is negated -| before being returned. `zSign' is ignored if the result is a NaN. -| The addition is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - -static float32 addFloat32Sigs( float32 a, float32 b, flag zSign STATUS_PARAM) +/* +------------------------------------------------------------------------------- +Returns the result of adding the absolute values of the single-precision +floating-point values `a' and `b'. If `zSign' is 1, the sum is negated +before being returned. `zSign' is ignored if the result is a NaN. +The addition is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ +static float32 addFloat32Sigs( float32 a, float32 b, flag zSign STATUS_PARAM ) { int_fast16_t aExp, bExp, zExp; uint32_t aSig, bSig, zSig; @@ -1783,15 +1860,16 @@ static float32 addFloat32Sigs( float32 a, float32 b, flag zSign STATUS_PARAM) } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the absolute values of the single- -| precision floating-point values `a' and `b'. If `zSign' is 1, the -| difference is negated before being returned. `zSign' is ignored if the -| result is a NaN. The subtraction is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - -static float32 subFloat32Sigs( float32 a, float32 b, flag zSign STATUS_PARAM) +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the absolute values of the single- +precision floating-point values `a' and `b'. If `zSign' is 1, the +difference is negated before being returned. `zSign' is ignored if the +result is a NaN. The subtraction is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ +static float32 subFloat32Sigs( float32 a, float32 b, flag zSign STATUS_PARAM ) { int_fast16_t aExp, bExp, zExp; uint32_t aSig, bSig, zSig; @@ -1858,12 +1936,13 @@ static float32 subFloat32Sigs( float32 a, float32 b, flag zSign STATUS_PARAM) } -/*---------------------------------------------------------------------------- -| Returns the result of adding the single-precision floating-point values `a' -| and `b'. The operation is performed according to the IEC/IEEE Standard for -| Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of adding the single-precision floating-point values `a' +and `b'. The operation is performed according to the IEC/IEEE Standard for +Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float32_add( float32 a, float32 b STATUS_PARAM ) { flag aSign, bSign; @@ -1881,12 +1960,13 @@ float32 float32_add( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the single-precision floating-point values -| `a' and `b'. The operation is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the single-precision floating-point values +`a' and `b'. The operation is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float32_sub( float32 a, float32 b STATUS_PARAM ) { flag aSign, bSign; @@ -1904,12 +1984,13 @@ float32 float32_sub( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of multiplying the single-precision floating-point values -| `a' and `b'. The operation is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of multiplying the single-precision floating-point values +`a' and `b'. The operation is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float32_mul( float32 a, float32 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -1967,12 +2048,13 @@ float32 float32_mul( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of dividing the single-precision floating-point value `a' -| by the corresponding value `b'. The operation is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of dividing the single-precision floating-point value `a' +by the corresponding value `b'. The operation is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float32_div( float32 a, float32 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -2031,12 +2113,13 @@ float32 float32_div( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the remainder of the single-precision floating-point value `a' -| with respect to the corresponding value `b'. The operation is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the remainder of the single-precision floating-point value `a' +with respect to the corresponding value `b'. The operation is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float32_rem( float32 a, float32 b STATUS_PARAM ) { flag aSign, zSign; @@ -2132,16 +2215,18 @@ float32 float32_rem( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of multiplying the single-precision floating-point values -| `a' and `b' then adding 'c', with no intermediate rounding step after the -| multiplication. The operation is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic 754-2008. -| The flags argument allows the caller to select negation of the -| addend, the intermediate product, or the final result. (The difference -| between this and having the caller do a separate negation is that negating -| externally will flip the sign bit on NaNs.) -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of multiplying the single-precision floating-point values +`a' and `b' then adding 'c', with no intermediate rounding step after the +multiplication. The operation is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic 754-2008. +The flags argument allows the caller to select negation of the +addend, the intermediate product, or the final result. (The difference +between this and having the caller do a separate negation is that negating +externally will flip the sign bit on NaNs.) +------------------------------------------------------------------------------- +*/ float32 float32_muladd(float32 a, float32 b, float32 c, int flags STATUS_PARAM) { @@ -2339,12 +2424,13 @@ float32 float32_muladd(float32 a, float32 b, float32 c, int flags STATUS_PARAM) } -/*---------------------------------------------------------------------------- -| Returns the square root of the single-precision floating-point value `a'. -| The operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the square root of the single-precision floating-point value `a'. +The operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float32_sqrt( float32 a STATUS_PARAM ) { flag aSign; @@ -2394,23 +2480,25 @@ float32 float32_sqrt( float32 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the binary exponential of the single-precision floating-point value -| `a'. The operation is performed according to the IEC/IEEE Standard for -| Binary Floating-Point Arithmetic. -| -| Uses the following identities: -| -| 1. ------------------------------------------------------------------------- -| x x*ln(2) -| 2 = e -| -| 2. ------------------------------------------------------------------------- -| 2 3 4 5 n -| x x x x x x x -| e = 1 + --- + --- + --- + --- + --- + ... + --- + ... -| 1! 2! 3! 4! 5! n! -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the binary exponential of the single-precision floating-point value +`a'. The operation is performed according to the IEC/IEEE Standard for +Binary Floating-Point Arithmetic. + +Uses the following identities: + +1. ------------------------------------------------------------------------- + x x*ln(2) + 2 = e + +2. ------------------------------------------------------------------------- + 2 3 4 5 n + x x x x x x x + e = 1 + --- + --- + --- + --- + --- + ... + --- + ... + 1! 2! 3! 4! 5! n! +------------------------------------------------------------------------------- +*/ static const float64 float32_exp2_coefficients[15] = { @@ -2474,11 +2562,13 @@ float32 float32_exp2( float32 a STATUS_PARAM ) return float64_to_float32(r, status); } -/*---------------------------------------------------------------------------- -| Returns the binary log of the single-precision floating-point value `a'. -| The operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the binary log of the single-precision floating-point value `a'. +The operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float32_log2( float32 a STATUS_PARAM ) { flag aSign, zSign; @@ -2522,12 +2612,14 @@ float32 float32_log2( float32 a STATUS_PARAM ) return normalizeRoundAndPackFloat32( zSign, 0x85, zSig STATUS_VAR ); } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is equal to -| the corresponding value `b', and 0 otherwise. The invalid exception is -| raised if either operand is a NaN. Otherwise, the comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is equal to +the corresponding value `b', and 0 otherwise. The invalid exception is +raised if either operand is a NaN. Otherwise, the comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_eq( float32 a, float32 b STATUS_PARAM ) { @@ -2546,12 +2638,14 @@ int float32_eq( float32 a, float32 b STATUS_PARAM ) return ( av == bv ) || ( (uint32_t) ( ( av | bv )<<1 ) == 0 ); } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is less than -| or equal to the corresponding value `b', and 0 otherwise. The invalid -| exception is raised if either operand is a NaN. The comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is less than +or equal to the corresponding value `b', and 0 otherwise. The invalid +exception is raised if either operand is a NaN. The comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_le( float32 a, float32 b STATUS_PARAM ) { @@ -2575,12 +2669,14 @@ int float32_le( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is less than -| the corresponding value `b', and 0 otherwise. The invalid exception is -| raised if either operand is a NaN. The comparison is performed according -| to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is less than +the corresponding value `b', and 0 otherwise. The invalid exception is +raised if either operand is a NaN. The comparison is performed according +to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_lt( float32 a, float32 b STATUS_PARAM ) { @@ -2604,12 +2700,14 @@ int float32_lt( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point values `a' and `b' cannot -| be compared, and 0 otherwise. The invalid exception is raised if either -| operand is a NaN. The comparison is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point values `a' and `b' cannot +be compared, and 0 otherwise. The invalid exception is raised if either +operand is a NaN. The comparison is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_unordered( float32 a, float32 b STATUS_PARAM ) { @@ -2625,12 +2723,14 @@ int float32_unordered( float32 a, float32 b STATUS_PARAM ) return 0; } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is equal to -| the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an -| exception. The comparison is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is equal to +the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an +exception. The comparison is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_eq_quiet( float32 a, float32 b STATUS_PARAM ) { @@ -2649,12 +2749,14 @@ int float32_eq_quiet( float32 a, float32 b STATUS_PARAM ) ( (uint32_t) ( ( float32_val(a) | float32_val(b) )<<1 ) == 0 ); } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is less than or -| equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not -| cause an exception. Otherwise, the comparison is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is less than or +equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not +cause an exception. Otherwise, the comparison is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_le_quiet( float32 a, float32 b STATUS_PARAM ) { @@ -2680,12 +2782,14 @@ int float32_le_quiet( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point value `a' is less than -| the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an -| exception. Otherwise, the comparison is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point value `a' is less than +the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an +exception. Otherwise, the comparison is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_lt_quiet( float32 a, float32 b STATUS_PARAM ) { @@ -2711,12 +2815,14 @@ int float32_lt_quiet( float32 a, float32 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the single-precision floating-point values `a' and `b' cannot -| be compared, and 0 otherwise. Quiet NaNs do not cause an exception. The -| comparison is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the single-precision floating-point values `a' and `b' cannot +be compared, and 0 otherwise. Quiet NaNs do not cause an exception. The +comparison is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float32_unordered_quiet( float32 a, float32 b STATUS_PARAM ) { @@ -2734,16 +2840,17 @@ int float32_unordered_quiet( float32 a, float32 b STATUS_PARAM ) return 0; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the 32-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic---which means in particular that the conversion is rounded -| according to the current rounding mode. If `a' is a NaN, the largest -| positive integer is returned. Otherwise, if the conversion overflows, the -| largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the 32-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic---which means in particular that the conversion is rounded +according to the current rounding mode. If `a' is a NaN, the largest +positive integer is returned. Otherwise, if the conversion overflows, the +largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int32 float64_to_int32( float64 a STATUS_PARAM ) { flag aSign; @@ -2762,16 +2869,17 @@ int32 float64_to_int32( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the 32-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. -| If `a' is a NaN, the largest positive integer is returned. Otherwise, if -| the conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the 32-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. +If `a' is a NaN, the largest positive integer is returned. Otherwise, if +the conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int32 float64_to_int32_round_to_zero( float64 a STATUS_PARAM ) { flag aSign; @@ -2809,15 +2917,17 @@ int32 float64_to_int32_round_to_zero( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the 16-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. -| If `a' is a NaN, the largest positive integer is returned. Otherwise, if -| the conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the 16-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. +If `a' is a NaN, the largest positive integer is returned. Otherwise, if +the conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int_fast16_t float64_to_int16_round_to_zero(float64 a STATUS_PARAM) { @@ -2860,16 +2970,17 @@ int_fast16_t float64_to_int16_round_to_zero(float64 a STATUS_PARAM) return z; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the 64-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic---which means in particular that the conversion is rounded -| according to the current rounding mode. If `a' is a NaN, the largest -| positive integer is returned. Otherwise, if the conversion overflows, the -| largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the 64-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic---which means in particular that the conversion is rounded +according to the current rounding mode. If `a' is a NaN, the largest +positive integer is returned. Otherwise, if the conversion overflows, the +largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int64 float64_to_int64( float64 a STATUS_PARAM ) { flag aSign; @@ -2903,16 +3014,17 @@ int64 float64_to_int64( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the 64-bit two's complement integer format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. -| If `a' is a NaN, the largest positive integer is returned. Otherwise, if -| the conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the 64-bit two's complement integer format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. +If `a' is a NaN, the largest positive integer is returned. Otherwise, if +the conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int64 float64_to_int64_round_to_zero( float64 a STATUS_PARAM ) { flag aSign; @@ -2956,13 +3068,14 @@ int64 float64_to_int64_round_to_zero( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the single-precision floating-point format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the single-precision floating-point format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float64_to_float32( float64 a STATUS_PARAM ) { flag aSign; @@ -2989,16 +3102,18 @@ float32 float64_to_float32( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Packs the sign `zSign', exponent `zExp', and significand `zSig' into a -| half-precision floating-point value, returning the result. After being -| shifted into the proper positions, the three fields are simply added -| together to form the result. This means that any integer portion of `zSig' -| will be added into the exponent. Since a properly normalized significand -| will have an integer portion equal to 1, the `zExp' input should be 1 less -| than the desired result exponent whenever `zSig' is a complete, normalized -| significand. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Packs the sign `zSign', exponent `zExp', and significand `zSig' into a +half-precision floating-point value, returning the result. After being +shifted into the proper positions, the three fields are simply added +together to form the result. This means that any integer portion of `zSig' +will be added into the exponent. Since a properly normalized significand +will have an integer portion equal to 1, the `zExp' input should be 1 less +than the desired result exponent whenever `zSig' is a complete, normalized +significand. +------------------------------------------------------------------------------- +*/ static float16 packFloat16(flag zSign, int_fast16_t zExp, uint16_t zSig) { return make_float16( @@ -3132,13 +3247,14 @@ float16 float32_to_float16(float32 a, flag ieee STATUS_PARAM) return packFloat16(aSign, aExp + 14, aSig >> 13); } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the extended double-precision floating-point format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the extended double-precision floating-point format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 float64_to_floatx80( float64 a STATUS_PARAM ) { flag aSign; @@ -3163,13 +3279,14 @@ floatx80 float64_to_floatx80( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the double-precision floating-point value -| `a' to the quadruple-precision floating-point format. The conversion is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the double-precision floating-point value +`a' to the quadruple-precision floating-point format. The conversion is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float64_to_float128( float64 a STATUS_PARAM ) { flag aSign; @@ -3194,13 +3311,14 @@ float128 float64_to_float128( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Rounds the double-precision floating-point value `a' to an integer, and -| returns the result as a double-precision floating-point value. The -| operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Rounds the double-precision floating-point value `a' to an integer, and +returns the result as a double-precision floating-point value. The +operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_round_to_int( float64 a STATUS_PARAM ) { flag aSign; @@ -3267,14 +3385,15 @@ float64 float64_trunc_to_int( float64 a STATUS_PARAM) return res; } -/*---------------------------------------------------------------------------- -| Returns the result of adding the absolute values of the double-precision -| floating-point values `a' and `b'. If `zSign' is 1, the sum is negated -| before being returned. `zSign' is ignored if the result is a NaN. -| The addition is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of adding the absolute values of the double-precision +floating-point values `a' and `b'. If `zSign' is 1, the sum is negated +before being returned. `zSign' is ignored if the result is a NaN. +The addition is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static float64 addFloat64Sigs( float64 a, float64 b, flag zSign STATUS_PARAM ) { int_fast16_t aExp, bExp, zExp; @@ -3346,14 +3465,15 @@ static float64 addFloat64Sigs( float64 a, float64 b, flag zSign STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the absolute values of the double- -| precision floating-point values `a' and `b'. If `zSign' is 1, the -| difference is negated before being returned. `zSign' is ignored if the -| result is a NaN. The subtraction is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the absolute values of the double- +precision floating-point values `a' and `b'. If `zSign' is 1, the +difference is negated before being returned. `zSign' is ignored if the +result is a NaN. The subtraction is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static float64 subFloat64Sigs( float64 a, float64 b, flag zSign STATUS_PARAM ) { int_fast16_t aExp, bExp, zExp; @@ -3421,12 +3541,13 @@ static float64 subFloat64Sigs( float64 a, float64 b, flag zSign STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of adding the double-precision floating-point values `a' -| and `b'. The operation is performed according to the IEC/IEEE Standard for -| Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of adding the double-precision floating-point values `a' +and `b'. The operation is performed according to the IEC/IEEE Standard for +Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_add( float64 a, float64 b STATUS_PARAM ) { flag aSign, bSign; @@ -3444,12 +3565,13 @@ float64 float64_add( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the double-precision floating-point values -| `a' and `b'. The operation is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the double-precision floating-point values +`a' and `b'. The operation is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_sub( float64 a, float64 b STATUS_PARAM ) { flag aSign, bSign; @@ -3467,12 +3589,13 @@ float64 float64_sub( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of multiplying the double-precision floating-point values -| `a' and `b'. The operation is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of multiplying the double-precision floating-point values +`a' and `b'. The operation is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_mul( float64 a, float64 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -3528,12 +3651,13 @@ float64 float64_mul( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of dividing the double-precision floating-point value `a' -| by the corresponding value `b'. The operation is performed according to -| the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of dividing the double-precision floating-point value `a' +by the corresponding value `b'. The operation is performed according to +the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_div( float64 a, float64 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -3600,12 +3724,13 @@ float64 float64_div( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the remainder of the double-precision floating-point value `a' -| with respect to the corresponding value `b'. The operation is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the remainder of the double-precision floating-point value `a' +with respect to the corresponding value `b'. The operation is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_rem( float64 a, float64 b STATUS_PARAM ) { flag aSign, zSign; @@ -3686,16 +3811,18 @@ float64 float64_rem( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of multiplying the double-precision floating-point values -| `a' and `b' then adding 'c', with no intermediate rounding step after the -| multiplication. The operation is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic 754-2008. -| The flags argument allows the caller to select negation of the -| addend, the intermediate product, or the final result. (The difference -| between this and having the caller do a separate negation is that negating -| externally will flip the sign bit on NaNs.) -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the result of multiplying the double-precision floating-point values +`a' and `b' then adding 'c', with no intermediate rounding step after the +multiplication. The operation is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic 754-2008. +The flags argument allows the caller to select negation of the +addend, the intermediate product, or the final result. (The difference +between this and having the caller do a separate negation is that negating +externally will flip the sign bit on NaNs.) +------------------------------------------------------------------------------- +*/ float64 float64_muladd(float64 a, float64 b, float64 c, int flags STATUS_PARAM) { @@ -3912,12 +4039,13 @@ float64 float64_muladd(float64 a, float64 b, float64 c, int flags STATUS_PARAM) } } -/*---------------------------------------------------------------------------- -| Returns the square root of the double-precision floating-point value `a'. -| The operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the square root of the double-precision floating-point value `a'. +The operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_sqrt( float64 a STATUS_PARAM ) { flag aSign; @@ -3964,11 +4092,13 @@ float64 float64_sqrt( float64 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the binary log of the double-precision floating-point value `a'. -| The operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns the binary log of the double-precision floating-point value `a'. +The operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float64_log2( float64 a STATUS_PARAM ) { flag aSign, zSign; @@ -4011,12 +4141,14 @@ float64 float64_log2( float64 a STATUS_PARAM ) return normalizeRoundAndPackFloat64( zSign, 0x408, zSig STATUS_VAR ); } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is equal to the -| corresponding value `b', and 0 otherwise. The invalid exception is raised -| if either operand is a NaN. Otherwise, the comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is equal to the +corresponding value `b', and 0 otherwise. The invalid exception is raised +if either operand is a NaN. Otherwise, the comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_eq( float64 a, float64 b STATUS_PARAM ) { @@ -4036,12 +4168,14 @@ int float64_eq( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is less than or -| equal to the corresponding value `b', and 0 otherwise. The invalid -| exception is raised if either operand is a NaN. The comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is less than or +equal to the corresponding value `b', and 0 otherwise. The invalid +exception is raised if either operand is a NaN. The comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_le( float64 a, float64 b STATUS_PARAM ) { @@ -4065,12 +4199,14 @@ int float64_le( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is less than -| the corresponding value `b', and 0 otherwise. The invalid exception is -| raised if either operand is a NaN. The comparison is performed according -| to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is less than +the corresponding value `b', and 0 otherwise. The invalid exception is +raised if either operand is a NaN. The comparison is performed according +to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_lt( float64 a, float64 b STATUS_PARAM ) { @@ -4094,12 +4230,14 @@ int float64_lt( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point values `a' and `b' cannot -| be compared, and 0 otherwise. The invalid exception is raised if either -| operand is a NaN. The comparison is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point values `a' and `b' cannot +be compared, and 0 otherwise. The invalid exception is raised if either +operand is a NaN. The comparison is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_unordered( float64 a, float64 b STATUS_PARAM ) { @@ -4115,12 +4253,14 @@ int float64_unordered( float64 a, float64 b STATUS_PARAM ) return 0; } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is equal to the -| corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an -| exception.The comparison is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is equal to the +corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an +exception.The comparison is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_eq_quiet( float64 a, float64 b STATUS_PARAM ) { @@ -4142,12 +4282,14 @@ int float64_eq_quiet( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is less than or -| equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not -| cause an exception. Otherwise, the comparison is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is less than or +equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not +cause an exception. Otherwise, the comparison is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_le_quiet( float64 a, float64 b STATUS_PARAM ) { @@ -4173,12 +4315,14 @@ int float64_le_quiet( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point value `a' is less than -| the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an -| exception. Otherwise, the comparison is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point value `a' is less than +the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an +exception. Otherwise, the comparison is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_lt_quiet( float64 a, float64 b STATUS_PARAM ) { @@ -4204,12 +4348,14 @@ int float64_lt_quiet( float64 a, float64 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the double-precision floating-point values `a' and `b' cannot -| be compared, and 0 otherwise. Quiet NaNs do not cause an exception. The -| comparison is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the double-precision floating-point values `a' and `b' cannot +be compared, and 0 otherwise. Quiet NaNs do not cause an exception. The +comparison is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float64_unordered_quiet( float64 a, float64 b STATUS_PARAM ) { @@ -4227,16 +4373,17 @@ int float64_unordered_quiet( float64 a, float64 b STATUS_PARAM ) return 0; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point value `a' to the 32-bit two's complement integer format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic---which means in particular that the conversion -| is rounded according to the current rounding mode. If `a' is a NaN, the -| largest positive integer is returned. Otherwise, if the conversion -| overflows, the largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point value `a' to the 32-bit two's complement integer format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic---which means in particular that the conversion +is rounded according to the current rounding mode. If `a' is a NaN, the +largest positive integer is returned. Otherwise, if the conversion +overflows, the largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int32 floatx80_to_int32( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4254,16 +4401,17 @@ int32 floatx80_to_int32( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point value `a' to the 32-bit two's complement integer format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic, except that the conversion is always rounded -| toward zero. If `a' is a NaN, the largest positive integer is returned. -| Otherwise, if the conversion overflows, the largest integer with the same -| sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point value `a' to the 32-bit two's complement integer format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic, except that the conversion is always rounded +toward zero. If `a' is a NaN, the largest positive integer is returned. +Otherwise, if the conversion overflows, the largest integer with the same +sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int32 floatx80_to_int32_round_to_zero( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4299,16 +4447,17 @@ int32 floatx80_to_int32_round_to_zero( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point value `a' to the 64-bit two's complement integer format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic---which means in particular that the conversion -| is rounded according to the current rounding mode. If `a' is a NaN, -| the largest positive integer is returned. Otherwise, if the conversion -| overflows, the largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point value `a' to the 64-bit two's complement integer format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic---which means in particular that the conversion +is rounded according to the current rounding mode. If `a' is a NaN, +the largest positive integer is returned. Otherwise, if the conversion +overflows, the largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int64 floatx80_to_int64( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4339,16 +4488,17 @@ int64 floatx80_to_int64( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point value `a' to the 64-bit two's complement integer format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic, except that the conversion is always rounded -| toward zero. If `a' is a NaN, the largest positive integer is returned. -| Otherwise, if the conversion overflows, the largest integer with the same -| sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point value `a' to the 64-bit two's complement integer format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic, except that the conversion is always rounded +toward zero. If `a' is a NaN, the largest positive integer is returned. +Otherwise, if the conversion overflows, the largest integer with the same +sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int64 floatx80_to_int64_round_to_zero( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4383,13 +4533,14 @@ int64 floatx80_to_int64_round_to_zero( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point value `a' to the single-precision floating-point format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point value `a' to the single-precision floating-point format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float32 floatx80_to_float32( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4411,13 +4562,14 @@ float32 floatx80_to_float32( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point value `a' to the double-precision floating-point format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point value `a' to the double-precision floating-point format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float64 floatx80_to_float64( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4439,13 +4591,14 @@ float64 floatx80_to_float64( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the extended double-precision floating- -| point value `a' to the quadruple-precision floating-point format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the extended double-precision floating- +point value `a' to the quadruple-precision floating-point format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 floatx80_to_float128( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4463,13 +4616,14 @@ float128 floatx80_to_float128( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Rounds the extended double-precision floating-point value `a' to an integer, -| and returns the result as an extended quadruple-precision floating-point -| value. The operation is performed according to the IEC/IEEE Standard for -| Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Rounds the extended double-precision floating-point value `a' to an integer, +and returns the result as an extended quadruple-precision floating-point +value. The operation is performed according to the IEC/IEEE Standard for +Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_round_to_int( floatx80 a STATUS_PARAM ) { flag aSign; @@ -4536,14 +4690,15 @@ floatx80 floatx80_round_to_int( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of adding the absolute values of the extended double- -| precision floating-point values `a' and `b'. If `zSign' is 1, the sum is -| negated before being returned. `zSign' is ignored if the result is a NaN. -| The addition is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of adding the absolute values of the extended double- +precision floating-point values `a' and `b'. If `zSign' is 1, the sum is +negated before being returned. `zSign' is ignored if the result is a NaN. +The addition is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static floatx80 addFloatx80Sigs( floatx80 a, floatx80 b, flag zSign STATUS_PARAM) { int32 aExp, bExp, zExp; @@ -4602,14 +4757,15 @@ static floatx80 addFloatx80Sigs( floatx80 a, floatx80 b, flag zSign STATUS_PARAM } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the absolute values of the extended -| double-precision floating-point values `a' and `b'. If `zSign' is 1, the -| difference is negated before being returned. `zSign' is ignored if the -| result is a NaN. The subtraction is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the absolute values of the extended +double-precision floating-point values `a' and `b'. If `zSign' is 1, the +difference is negated before being returned. `zSign' is ignored if the +result is a NaN. The subtraction is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static floatx80 subFloatx80Sigs( floatx80 a, floatx80 b, flag zSign STATUS_PARAM ) { int32 aExp, bExp, zExp; @@ -4670,12 +4826,13 @@ static floatx80 subFloatx80Sigs( floatx80 a, floatx80 b, flag zSign STATUS_PARAM } -/*---------------------------------------------------------------------------- -| Returns the result of adding the extended double-precision floating-point -| values `a' and `b'. The operation is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of adding the extended double-precision floating-point +values `a' and `b'. The operation is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_add( floatx80 a, floatx80 b STATUS_PARAM ) { flag aSign, bSign; @@ -4691,12 +4848,13 @@ floatx80 floatx80_add( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the extended double-precision floating- -| point values `a' and `b'. The operation is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the extended double-precision floating- +point values `a' and `b'. The operation is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_sub( floatx80 a, floatx80 b STATUS_PARAM ) { flag aSign, bSign; @@ -4712,12 +4870,13 @@ floatx80 floatx80_sub( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of multiplying the extended double-precision floating- -| point values `a' and `b'. The operation is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of multiplying the extended double-precision floating- +point values `a' and `b'. The operation is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_mul( floatx80 a, floatx80 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -4771,12 +4930,13 @@ floatx80 floatx80_mul( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of dividing the extended double-precision floating-point -| value `a' by the corresponding value `b'. The operation is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of dividing the extended double-precision floating-point +value `a' by the corresponding value `b'. The operation is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_div( floatx80 a, floatx80 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -4851,12 +5011,13 @@ floatx80 floatx80_div( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the remainder of the extended double-precision floating-point value -| `a' with respect to the corresponding value `b'. The operation is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the remainder of the extended double-precision floating-point value +`a' with respect to the corresponding value `b'. The operation is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_rem( floatx80 a, floatx80 b STATUS_PARAM ) { flag aSign, zSign; @@ -4947,12 +5108,13 @@ floatx80 floatx80_rem( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the square root of the extended double-precision floating-point -| value `a'. The operation is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the square root of the extended double-precision floating-point +value `a'. The operation is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_sqrt( floatx80 a STATUS_PARAM ) { flag aSign; @@ -5017,12 +5179,14 @@ floatx80 floatx80_sqrt( floatx80 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is equal -| to the corresponding value `b', and 0 otherwise. The invalid exception is -| raised if either operand is a NaN. Otherwise, the comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is equal +to the corresponding value `b', and 0 otherwise. The invalid exception is +raised if either operand is a NaN. Otherwise, the comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_eq( floatx80 a, floatx80 b STATUS_PARAM ) { @@ -5044,13 +5208,15 @@ int floatx80_eq( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is -| less than or equal to the corresponding value `b', and 0 otherwise. The -| invalid exception is raised if either operand is a NaN. The comparison is -| performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is +less than or equal to the corresponding value `b', and 0 otherwise. The +invalid exception is raised if either operand is a NaN. The comparison is +performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_le( floatx80 a, floatx80 b STATUS_PARAM ) { @@ -5078,12 +5244,14 @@ int floatx80_le( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is -| less than the corresponding value `b', and 0 otherwise. The invalid -| exception is raised if either operand is a NaN. The comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is +less than the corresponding value `b', and 0 otherwise. The invalid +exception is raised if either operand is a NaN. The comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_lt( floatx80 a, floatx80 b STATUS_PARAM ) { @@ -5111,12 +5279,14 @@ int floatx80_lt( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point values `a' and `b' -| cannot be compared, and 0 otherwise. The invalid exception is raised if -| either operand is a NaN. The comparison is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point values `a' and `b' +cannot be compared, and 0 otherwise. The invalid exception is raised if +either operand is a NaN. The comparison is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_unordered( floatx80 a, floatx80 b STATUS_PARAM ) { if ( ( ( extractFloatx80Exp( a ) == 0x7FFF ) @@ -5130,12 +5300,14 @@ int floatx80_unordered( floatx80 a, floatx80 b STATUS_PARAM ) return 0; } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is -| equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not -| cause an exception. The comparison is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is +equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not +cause an exception. The comparison is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_eq_quiet( floatx80 a, floatx80 b STATUS_PARAM ) { @@ -5160,12 +5332,14 @@ int floatx80_eq_quiet( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is less -| than or equal to the corresponding value `b', and 0 otherwise. Quiet NaNs -| do not cause an exception. Otherwise, the comparison is performed according -| to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is less +than or equal to the corresponding value `b', and 0 otherwise. Quiet NaNs +do not cause an exception. Otherwise, the comparison is performed according +to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_le_quiet( floatx80 a, floatx80 b STATUS_PARAM ) { @@ -5196,12 +5370,14 @@ int floatx80_le_quiet( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point value `a' is less -| than the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause -| an exception. Otherwise, the comparison is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point value `a' is less +than the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause +an exception. Otherwise, the comparison is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_lt_quiet( floatx80 a, floatx80 b STATUS_PARAM ) { @@ -5232,12 +5408,14 @@ int floatx80_lt_quiet( floatx80 a, floatx80 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the extended double-precision floating-point values `a' and `b' -| cannot be compared, and 0 otherwise. Quiet NaNs do not cause an exception. -| The comparison is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the extended double-precision floating-point values `a' and `b' +cannot be compared, and 0 otherwise. Quiet NaNs do not cause an exception. +The comparison is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int floatx80_unordered_quiet( floatx80 a, floatx80 b STATUS_PARAM ) { if ( ( ( extractFloatx80Exp( a ) == 0x7FFF ) @@ -5254,16 +5432,17 @@ int floatx80_unordered_quiet( floatx80 a, floatx80 b STATUS_PARAM ) return 0; } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point -| value `a' to the 32-bit two's complement integer format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic---which means in particular that the conversion is rounded -| according to the current rounding mode. If `a' is a NaN, the largest -| positive integer is returned. Otherwise, if the conversion overflows, the -| largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point +value `a' to the 32-bit two's complement integer format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic---which means in particular that the conversion is rounded +according to the current rounding mode. If `a' is a NaN, the largest +positive integer is returned. Otherwise, if the conversion overflows, the +largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int32 float128_to_int32( float128 a STATUS_PARAM ) { flag aSign; @@ -5283,16 +5462,17 @@ int32 float128_to_int32( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point -| value `a' to the 32-bit two's complement integer format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. If -| `a' is a NaN, the largest positive integer is returned. Otherwise, if the -| conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point +value `a' to the 32-bit two's complement integer format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. If +`a' is a NaN, the largest positive integer is returned. Otherwise, if the +conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int32 float128_to_int32_round_to_zero( float128 a STATUS_PARAM ) { flag aSign; @@ -5331,16 +5511,17 @@ int32 float128_to_int32_round_to_zero( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point -| value `a' to the 64-bit two's complement integer format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic---which means in particular that the conversion is rounded -| according to the current rounding mode. If `a' is a NaN, the largest -| positive integer is returned. Otherwise, if the conversion overflows, the -| largest integer with the same sign as `a' is returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point +value `a' to the 64-bit two's complement integer format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic---which means in particular that the conversion is rounded +according to the current rounding mode. If `a' is a NaN, the largest +positive integer is returned. Otherwise, if the conversion overflows, the +largest integer with the same sign as `a' is returned. +------------------------------------------------------------------------------- +*/ int64 float128_to_int64( float128 a STATUS_PARAM ) { flag aSign; @@ -5374,16 +5555,17 @@ int64 float128_to_int64( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point -| value `a' to the 64-bit two's complement integer format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic, except that the conversion is always rounded toward zero. -| If `a' is a NaN, the largest positive integer is returned. Otherwise, if -| the conversion overflows, the largest integer with the same sign as `a' is -| returned. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point +value `a' to the 64-bit two's complement integer format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic, except that the conversion is always rounded toward zero. +If `a' is a NaN, the largest positive integer is returned. Otherwise, if +the conversion overflows, the largest integer with the same sign as `a' is +returned. +------------------------------------------------------------------------------- +*/ int64 float128_to_int64_round_to_zero( float128 a STATUS_PARAM ) { flag aSign; @@ -5435,13 +5617,14 @@ int64 float128_to_int64_round_to_zero( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point -| value `a' to the single-precision floating-point format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point +value `a' to the single-precision floating-point format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ float32 float128_to_float32( float128 a STATUS_PARAM ) { flag aSign; @@ -5470,13 +5653,14 @@ float32 float128_to_float32( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point -| value `a' to the double-precision floating-point format. The conversion -| is performed according to the IEC/IEEE Standard for Binary Floating-Point -| Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point +value `a' to the double-precision floating-point format. The conversion +is performed according to the IEC/IEEE Standard for Binary Floating-Point +Arithmetic. +------------------------------------------------------------------------------- +*/ float64 float128_to_float64( float128 a STATUS_PARAM ) { flag aSign; @@ -5503,13 +5687,14 @@ float64 float128_to_float64( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of converting the quadruple-precision floating-point -| value `a' to the extended double-precision floating-point format. The -| conversion is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of converting the quadruple-precision floating-point +value `a' to the extended double-precision floating-point format. The +conversion is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ floatx80 float128_to_floatx80( float128 a STATUS_PARAM ) { flag aSign; @@ -5538,13 +5723,14 @@ floatx80 float128_to_floatx80( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Rounds the quadruple-precision floating-point value `a' to an integer, and -| returns the result as a quadruple-precision floating-point value. The -| operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Rounds the quadruple-precision floating-point value `a' to an integer, and +returns the result as a quadruple-precision floating-point value. The +operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float128_round_to_int( float128 a STATUS_PARAM ) { flag aSign; @@ -5641,14 +5827,15 @@ float128 float128_round_to_int( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of adding the absolute values of the quadruple-precision -| floating-point values `a' and `b'. If `zSign' is 1, the sum is negated -| before being returned. `zSign' is ignored if the result is a NaN. -| The addition is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of adding the absolute values of the quadruple-precision +floating-point values `a' and `b'. If `zSign' is 1, the sum is negated +before being returned. `zSign' is ignored if the result is a NaN. +The addition is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static float128 addFloat128Sigs( float128 a, float128 b, flag zSign STATUS_PARAM) { int32 aExp, bExp, zExp; @@ -5727,14 +5914,15 @@ static float128 addFloat128Sigs( float128 a, float128 b, flag zSign STATUS_PARAM } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the absolute values of the quadruple- -| precision floating-point values `a' and `b'. If `zSign' is 1, the -| difference is negated before being returned. `zSign' is ignored if the -| result is a NaN. The subtraction is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the absolute values of the quadruple- +precision floating-point values `a' and `b'. If `zSign' is 1, the +difference is negated before being returned. `zSign' is ignored if the +result is a NaN. The subtraction is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ static float128 subFloat128Sigs( float128 a, float128 b, flag zSign STATUS_PARAM) { int32 aExp, bExp, zExp; @@ -5811,12 +5999,13 @@ static float128 subFloat128Sigs( float128 a, float128 b, flag zSign STATUS_PARAM } -/*---------------------------------------------------------------------------- -| Returns the result of adding the quadruple-precision floating-point values -| `a' and `b'. The operation is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of adding the quadruple-precision floating-point values +`a' and `b'. The operation is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float128_add( float128 a, float128 b STATUS_PARAM ) { flag aSign, bSign; @@ -5832,12 +6021,13 @@ float128 float128_add( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of subtracting the quadruple-precision floating-point -| values `a' and `b'. The operation is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of subtracting the quadruple-precision floating-point +values `a' and `b'. The operation is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float128_sub( float128 a, float128 b STATUS_PARAM ) { flag aSign, bSign; @@ -5853,12 +6043,13 @@ float128 float128_sub( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of multiplying the quadruple-precision floating-point -| values `a' and `b'. The operation is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of multiplying the quadruple-precision floating-point +values `a' and `b'. The operation is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float128_mul( float128 a, float128 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -5917,12 +6108,13 @@ float128 float128_mul( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the result of dividing the quadruple-precision floating-point value -| `a' by the corresponding value `b'. The operation is performed according to -| the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the result of dividing the quadruple-precision floating-point value +`a' by the corresponding value `b'. The operation is performed according to +the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float128_div( float128 a, float128 b STATUS_PARAM ) { flag aSign, bSign, zSign; @@ -6001,12 +6193,13 @@ float128 float128_div( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the remainder of the quadruple-precision floating-point value `a' -| with respect to the corresponding value `b'. The operation is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the remainder of the quadruple-precision floating-point value `a' +with respect to the corresponding value `b'. The operation is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float128_rem( float128 a, float128 b STATUS_PARAM ) { flag aSign, zSign; @@ -6110,12 +6303,13 @@ float128 float128_rem( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns the square root of the quadruple-precision floating-point value `a'. -| The operation is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ - +/* +------------------------------------------------------------------------------- +Returns the square root of the quadruple-precision floating-point value `a'. +The operation is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ float128 float128_sqrt( float128 a STATUS_PARAM ) { flag aSign; @@ -6179,12 +6373,14 @@ float128 float128_sqrt( float128 a STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is equal to -| the corresponding value `b', and 0 otherwise. The invalid exception is -| raised if either operand is a NaN. Otherwise, the comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is equal to +the corresponding value `b', and 0 otherwise. The invalid exception is +raised if either operand is a NaN. Otherwise, the comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_eq( float128 a, float128 b STATUS_PARAM ) { @@ -6206,12 +6402,14 @@ int float128_eq( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is less than -| or equal to the corresponding value `b', and 0 otherwise. The invalid -| exception is raised if either operand is a NaN. The comparison is performed -| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is less than +or equal to the corresponding value `b', and 0 otherwise. The invalid +exception is raised if either operand is a NaN. The comparison is performed +according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_le( float128 a, float128 b STATUS_PARAM ) { @@ -6239,12 +6437,14 @@ int float128_le( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is less than -| the corresponding value `b', and 0 otherwise. The invalid exception is -| raised if either operand is a NaN. The comparison is performed according -| to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is less than +the corresponding value `b', and 0 otherwise. The invalid exception is +raised if either operand is a NaN. The comparison is performed according +to the IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_lt( float128 a, float128 b STATUS_PARAM ) { @@ -6272,12 +6472,14 @@ int float128_lt( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point values `a' and `b' cannot -| be compared, and 0 otherwise. The invalid exception is raised if either -| operand is a NaN. The comparison is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point values `a' and `b' cannot +be compared, and 0 otherwise. The invalid exception is raised if either +operand is a NaN. The comparison is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_unordered( float128 a, float128 b STATUS_PARAM ) { @@ -6292,12 +6494,14 @@ int float128_unordered( float128 a, float128 b STATUS_PARAM ) return 0; } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is equal to -| the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an -| exception. The comparison is performed according to the IEC/IEEE Standard -| for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is equal to +the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an +exception. The comparison is performed according to the IEC/IEEE Standard +for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_eq_quiet( float128 a, float128 b STATUS_PARAM ) { @@ -6322,12 +6526,14 @@ int float128_eq_quiet( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is less than -| or equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not -| cause an exception. Otherwise, the comparison is performed according to the -| IEC/IEEE Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is less than +or equal to the corresponding value `b', and 0 otherwise. Quiet NaNs do not +cause an exception. Otherwise, the comparison is performed according to the +IEC/IEEE Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_le_quiet( float128 a, float128 b STATUS_PARAM ) { @@ -6358,12 +6564,14 @@ int float128_le_quiet( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point value `a' is less than -| the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an -| exception. Otherwise, the comparison is performed according to the IEC/IEEE -| Standard for Binary Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point value `a' is less than +the corresponding value `b', and 0 otherwise. Quiet NaNs do not cause an +exception. Otherwise, the comparison is performed according to the IEC/IEEE +Standard for Binary Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_lt_quiet( float128 a, float128 b STATUS_PARAM ) { @@ -6394,12 +6602,14 @@ int float128_lt_quiet( float128 a, float128 b STATUS_PARAM ) } -/*---------------------------------------------------------------------------- -| Returns 1 if the quadruple-precision floating-point values `a' and `b' cannot -| be compared, and 0 otherwise. Quiet NaNs do not cause an exception. The -| comparison is performed according to the IEC/IEEE Standard for Binary -| Floating-Point Arithmetic. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Returns 1 if the quadruple-precision floating-point values `a' and `b' cannot +be compared, and 0 otherwise. Quiet NaNs do not cause an exception. The +comparison is performed according to the IEC/IEEE Standard for Binary +Floating-Point Arithmetic. +------------------------------------------------------------------------------- +*/ int float128_unordered_quiet( float128 a, float128 b STATUS_PARAM ) { diff --git a/include/fpu/softfloat.h b/include/fpu/softfloat.h index f3927e2..b646621 100644 --- a/include/fpu/softfloat.h +++ b/include/fpu/softfloat.h @@ -4,10 +4,11 @@ * Derived from SoftFloat. */ -/*============================================================================ +/* +============================================================================ -This C header file is part of the SoftFloat IEC/IEEE Floating-point Arithmetic -Package, Release 2b. +This C header file is part of the SoftFloat IEC/IEEE Floating-point +Arithmetic Package, Release 2a. Written by John R. Hauser. This work was made possible in part by the International Computer Science Institute, located at Suite 600, 1947 Center @@ -16,24 +17,22 @@ National Science Foundation under grant MIP-9311980. The original version of this code was written as part of a project to build a fixed-point vector processor in collaboration with the University of California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. More information -is available through the Web page `http://www.cs.berkeley.edu/~jhauser/ +is available through the Web page `http://HTTP.CS.Berkeley.EDU/~jhauser/ arithmetic/SoftFloat.html'. -THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort has -been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT TIMES -RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO PERSONS -AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ALL LOSSES, -COSTS, OR OTHER PROBLEMS THEY INCUR DUE TO THE SOFTWARE, AND WHO FURTHERMORE -EFFECTIVELY INDEMNIFY JOHN HAUSER AND THE INTERNATIONAL COMPUTER SCIENCE -INSTITUTE (possibly via similar legal warning) AGAINST ALL LOSSES, COSTS, OR -OTHER PROBLEMS INCURRED BY THEIR CUSTOMERS AND CLIENTS DUE TO THE SOFTWARE. +THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort +has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT +TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO +PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY +AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. Derivative works are acceptable, even for commercial purposes, so long as -(1) the source code for the derivative work includes prominent notice that -the work is derivative, and (2) the source code includes prominent notice with -these four paragraphs for those parts of this code that are retained. +(1) they include prominent notice that the work is derivative, and (2) they +include prominent notice akin to these four paragraphs for those parts of +this code that are retained. -=============================================================================*/ +=============================================================================== +*/ #ifndef SOFTFLOAT_H #define SOFTFLOAT_H @@ -46,14 +45,16 @@ these four paragraphs for those parts of this code that are retained. #include "config-host.h" #include "qemu/osdep.h" -/*---------------------------------------------------------------------------- -| Each of the following `typedef's defines the most convenient type that holds -| integers of at least as many bits as specified. For example, `uint8' should -| be the most convenient type that can hold unsigned integers of as many as -| 8 bits. The `flag' type must be able to hold either a 0 or 1. For most -| implementations of C, `flag', `uint8', and `int8' should all be `typedef'ed -| to the same as `int'. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Each of the following `typedef's defines the most convenient type that holds +integers of at least as many bits as specified. For example, `uint8' should +be the most convenient type that can hold unsigned integers of as many as +8 bits. The `flag' type must be able to hold either a 0 or 1. For most +implementations of C, `flag', `uint8', and `int8' should all be `typedef'ed +to the same as `int'. +------------------------------------------------------------------------------- +*/ typedef uint8_t flag; typedef uint8_t uint8; typedef int8_t int8; @@ -69,9 +70,11 @@ typedef int64_t int64; #define STATUS(field) status->field #define STATUS_VAR , status -/*---------------------------------------------------------------------------- -| Software IEC/IEEE floating-point ordering relations -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE floating-point ordering relations +------------------------------------------------------------------------------- +*/ enum { float_relation_less = -1, float_relation_equal = 0, @@ -79,9 +82,11 @@ enum { float_relation_unordered = 2 }; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE floating-point types. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE floating-point types. +------------------------------------------------------------------------------- +*/ /* Use structures for soft-float types. This prevents accidentally mixing them with native int/float types. A sufficiently clever compiler and sane ABI should be able to see though these structs. However @@ -137,17 +142,21 @@ typedef struct { #define make_float128(high_, low_) ((float128) { .high = high_, .low = low_ }) #define make_float128_init(high_, low_) { .high = high_, .low = low_ } -/*---------------------------------------------------------------------------- -| Software IEC/IEEE floating-point underflow tininess-detection mode. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE floating-point underflow tininess-detection mode. +------------------------------------------------------------------------------- +*/ enum { float_tininess_after_rounding = 0, float_tininess_before_rounding = 1 }; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE floating-point rounding mode. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE floating-point rounding mode. +------------------------------------------------------------------------------- +*/ enum { float_round_nearest_even = 0, float_round_down = 1, @@ -155,9 +164,11 @@ enum { float_round_to_zero = 3 }; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE floating-point exception flags. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE floating-point exception flags. +------------------------------------------------------------------------------- +*/ enum { float_flag_invalid = 1, float_flag_divbyzero = 4, @@ -167,7 +178,6 @@ enum { float_flag_input_denormal = 64, float_flag_output_denormal = 128 }; - typedef struct float_status { signed char float_detect_tininess; signed char float_rounding_mode; @@ -204,27 +214,33 @@ INLINE int get_float_exception_flags(float_status *status) } void set_floatx80_rounding_precision(int val STATUS_PARAM); -/*---------------------------------------------------------------------------- -| Routine to raise any or all of the software IEC/IEEE floating-point -| exception flags. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Routine to raise any or all of the software IEC/IEEE floating-point +exception flags. +------------------------------------------------------------------------------- +*/ void float_raise( int8 flags STATUS_PARAM); -/*---------------------------------------------------------------------------- -| Options to indicate which negations to perform in float*_muladd() -| Using these differs from negating an input or output before calling -| the muladd function in that this means that a NaN doesn't have its -| sign bit inverted before it is propagated. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Options to indicate which negations to perform in float*_muladd() +Using these differs from negating an input or output before calling +the muladd function in that this means that a NaN doesn't have its +sign bit inverted before it is propagated. +------------------------------------------------------------------------------- +*/ enum { float_muladd_negate_c = 1, float_muladd_negate_product = 2, float_muladd_negate_result = 4, }; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE integer-to-floating-point conversion routines. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE integer-to-floating-point conversion routines. +------------------------------------------------------------------------------- +*/ float32 int32_to_float32( int32 STATUS_PARAM ); float64 int32_to_float64( int32 STATUS_PARAM ); float32 uint32_to_float32( uint32 STATUS_PARAM ); @@ -239,15 +255,19 @@ floatx80 int64_to_floatx80( int64 STATUS_PARAM ); float128 int64_to_float128( int64 STATUS_PARAM ); float128 uint64_to_float128( uint64 STATUS_PARAM ); -/*---------------------------------------------------------------------------- -| Software half-precision conversion routines. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software half-precision conversion routines. +*---------------------------------------------------------------------------- +*/ float16 float32_to_float16( float32, flag STATUS_PARAM ); float32 float16_to_float32( float16, flag STATUS_PARAM ); -/*---------------------------------------------------------------------------- -| Software half-precision operations. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software half-precision operations. +------------------------------------------------------------------------------- +*/ int float16_is_quiet_nan( float16 ); int float16_is_signaling_nan( float16 ); float16 float16_maybe_silence_nan( float16 ); @@ -257,14 +277,18 @@ INLINE int float16_is_any_nan(float16 a) return ((float16_val(a) & ~0x8000) > 0x7c00); } -/*---------------------------------------------------------------------------- -| The pattern for a default generated half-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated half-precision NaN. +------------------------------------------------------------------------------- +*/ extern const float16 float16_default_nan; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE single-precision conversion routines. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE single-precision conversion routines. +------------------------------------------------------------------------------- +*/ int_fast16_t float32_to_int16_round_to_zero(float32 STATUS_PARAM); uint_fast16_t float32_to_uint16_round_to_zero(float32 STATUS_PARAM); int32 float32_to_int32( float32 STATUS_PARAM ); @@ -277,9 +301,11 @@ float64 float32_to_float64( float32 STATUS_PARAM ); floatx80 float32_to_floatx80( float32 STATUS_PARAM ); float128 float32_to_float128( float32 STATUS_PARAM ); -/*---------------------------------------------------------------------------- -| Software IEC/IEEE single-precision operations. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE single-precision operations. +------------------------------------------------------------------------------- +*/ float32 float32_round_to_int( float32 STATUS_PARAM ); float32 float32_add( float32, float32 STATUS_PARAM ); float32 float32_sub( float32, float32 STATUS_PARAM ); @@ -361,14 +387,18 @@ INLINE float32 float32_set_sign(float32 a, int sign) #define float32_infinity make_float32(0x7f800000) -/*---------------------------------------------------------------------------- -| The pattern for a default generated single-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated single-precision NaN. +------------------------------------------------------------------------------- +*/ extern const float32 float32_default_nan; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE double-precision conversion routines. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE double-precision conversion routines. +------------------------------------------------------------------------------- +*/ int_fast16_t float64_to_int16_round_to_zero(float64 STATUS_PARAM); uint_fast16_t float64_to_uint16_round_to_zero(float64 STATUS_PARAM); int32 float64_to_int32( float64 STATUS_PARAM ); @@ -383,9 +413,11 @@ float32 float64_to_float32( float64 STATUS_PARAM ); floatx80 float64_to_floatx80( float64 STATUS_PARAM ); float128 float64_to_float128( float64 STATUS_PARAM ); -/*---------------------------------------------------------------------------- -| Software IEC/IEEE double-precision operations. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE double-precision operations. +------------------------------------------------------------------------------- +*/ float64 float64_round_to_int( float64 STATUS_PARAM ); float64 float64_trunc_to_int( float64 STATUS_PARAM ); float64 float64_add( float64, float64 STATUS_PARAM ); @@ -467,14 +499,18 @@ INLINE float64 float64_set_sign(float64 a, int sign) #define float64_half make_float64(0x3fe0000000000000LL) #define float64_infinity make_float64(0x7ff0000000000000LL) -/*---------------------------------------------------------------------------- -| The pattern for a default generated double-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated double-precision NaN. +------------------------------------------------------------------------------- +*/ extern const float64 float64_default_nan; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE extended double-precision conversion routines. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE extended double-precision conversion routines. +------------------------------------------------------------------------------- +*/ int32 floatx80_to_int32( floatx80 STATUS_PARAM ); int32 floatx80_to_int32_round_to_zero( floatx80 STATUS_PARAM ); int64 floatx80_to_int64( floatx80 STATUS_PARAM ); @@ -483,9 +519,11 @@ float32 floatx80_to_float32( floatx80 STATUS_PARAM ); float64 floatx80_to_float64( floatx80 STATUS_PARAM ); float128 floatx80_to_float128( floatx80 STATUS_PARAM ); -/*---------------------------------------------------------------------------- -| Software IEC/IEEE extended double-precision operations. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE extended double-precision operations. +------------------------------------------------------------------------------- +*/ floatx80 floatx80_round_to_int( floatx80 STATUS_PARAM ); floatx80 floatx80_add( floatx80, floatx80 STATUS_PARAM ); floatx80 floatx80_sub( floatx80, floatx80 STATUS_PARAM ); @@ -552,14 +590,18 @@ INLINE int floatx80_is_any_nan(floatx80 a) #define floatx80_half make_floatx80(0x3ffe, 0x8000000000000000LL) #define floatx80_infinity make_floatx80(0x7fff, 0x8000000000000000LL) -/*---------------------------------------------------------------------------- -| The pattern for a default generated extended double-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated extended double-precision NaN. +------------------------------------------------------------------------------- +*/ extern const floatx80 floatx80_default_nan; -/*---------------------------------------------------------------------------- -| Software IEC/IEEE quadruple-precision conversion routines. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE quadruple-precision conversion routines. +------------------------------------------------------------------------------- +*/ int32 float128_to_int32( float128 STATUS_PARAM ); int32 float128_to_int32_round_to_zero( float128 STATUS_PARAM ); int64 float128_to_int64( float128 STATUS_PARAM ); @@ -568,9 +610,11 @@ float32 float128_to_float32( float128 STATUS_PARAM ); float64 float128_to_float64( float128 STATUS_PARAM ); floatx80 float128_to_floatx80( float128 STATUS_PARAM ); -/*---------------------------------------------------------------------------- -| Software IEC/IEEE quadruple-precision operations. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +Software IEC/IEEE quadruple-precision operations. +------------------------------------------------------------------------------- +*/ float128 float128_round_to_int( float128 STATUS_PARAM ); float128 float128_add( float128, float128 STATUS_PARAM ); float128 float128_sub( float128, float128 STATUS_PARAM ); @@ -633,9 +677,11 @@ INLINE int float128_is_any_nan(float128 a) #define float128_zero make_float128(0, 0) -/*---------------------------------------------------------------------------- -| The pattern for a default generated quadruple-precision NaN. -*----------------------------------------------------------------------------*/ +/* +------------------------------------------------------------------------------- +The pattern for a default generated quadruple-precision NaN. +------------------------------------------------------------------------------- +*/ extern const float128 float128_default_nan; #endif /* !SOFTFLOAT_H */