From patchwork Wed Oct 16 01:46:39 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hans-Peter Nilsson X-Patchwork-Id: 283812 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id B112E2C034C for ; Wed, 16 Oct 2013 12:47:07 +1100 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :message-id:from:to:subject:mime-version:content-type :content-transfer-encoding; q=dns; s=default; b=LGGLo36J4+Qz48Xt It8U4D8f6+Jz/LPXj81bHP8oHmixQD0QfZ/QbsJqxNSkVvAM1YVulSiLAGELgmvH u7+Sj8KJyamGBXWG0zwCt/ODfSNMbQjSJk2t/mRWtOImFLzfPjoDG9CAO6aF+BUt DqT3olPH9+iZDZFWgSgAvV55pDI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :message-id:from:to:subject:mime-version:content-type :content-transfer-encoding; s=default; bh=+pCDznQDnJf9wby3viP+Kx pm4JQ=; b=nFFExFU0bDs1yKl+5rcKjQCm6fwLTp+/Ivs45M6MuHtupS2AsC4ilh kEisJIOru6zdL3iGSSHZRSNKCjpjotho1dIZqAyvu2JNJyBwQATRkTIjtoO4NTfz 1q4GMUluDd9SJLX2PZ51cTiXvrti8ndA9BhimLz/oNPyAfnuD6hJA= Received: (qmail 18917 invoked by alias); 16 Oct 2013 01:47:00 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 18906 invoked by uid 89); 16 Oct 2013 01:46:59 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL, BAYES_50, RP_MATCHES_RCVD, SPF_PASS autolearn=ham version=3.3.2 X-HELO: anubis.se.axis.com Received: from anubis.se.axis.com (HELO anubis.se.axis.com) (195.60.68.12) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 16 Oct 2013 01:46:56 +0000 Received: from localhost (localhost [127.0.0.1]) by anubis.se.axis.com (Postfix) with ESMTP id 63E3B19E0E for ; Wed, 16 Oct 2013 03:46:51 +0200 (CEST) Received: from anubis.se.axis.com ([127.0.0.1]) by localhost (anubis.se.axis.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id bs+fMMt0h0gd for ; Wed, 16 Oct 2013 03:46:48 +0200 (CEST) Received: from boulder.se.axis.com (boulder.se.axis.com [10.0.2.104]) by anubis.se.axis.com (Postfix) with ESMTP id D042F19DC1 for ; Wed, 16 Oct 2013 03:46:41 +0200 (CEST) Received: from boulder.se.axis.com (localhost [127.0.0.1]) by postfix.imss71 (Postfix) with ESMTP id 94C61A77 for ; Wed, 16 Oct 2013 03:46:39 +0200 (CEST) Received: from seth.se.axis.com (seth.se.axis.com [10.0.2.172]) by boulder.se.axis.com (Postfix) with ESMTP id 87EE8A74 for ; Wed, 16 Oct 2013 03:46:39 +0200 (CEST) Received: from ignucius.se.axis.com (ignucius.se.axis.com [10.88.21.50]) by seth.se.axis.com (Postfix) with ESMTP id 84AB23E06F; Wed, 16 Oct 2013 03:46:39 +0200 (CEST) Received: from ignucius.se.axis.com (localhost [127.0.0.1]) by ignucius.se.axis.com (8.12.8p1/8.12.8/Debian-2woody1) with ESMTP id r9G1kdY2010789; Wed, 16 Oct 2013 03:46:39 +0200 Received: (from hp@localhost) by ignucius.se.axis.com (8.12.8p1/8.12.8/Debian-2woody1) id r9G1kdrH010785; Wed, 16 Oct 2013 03:46:39 +0200 Date: Wed, 16 Oct 2013 03:46:39 +0200 Message-Id: <201310160146.r9G1kdrH010785@ignucius.se.axis.com> From: Hans-Peter Nilsson To: gcc-patches@gcc.gnu.org Subject: Committed: CRIS: new multilib for v8, libgcc improvements and move to soft-fp. MIME-Version: 1.0 There's a page-full or two of numbers reported here with the background, but maintainers of software-floating-point ports used with a microcontroller may find that of use, if they're on a cycle or size budget and consider fp-bit vs. soft-fp. For an on-chip controller subsystem with a CRIS CPU, there was a control-loop with floating-point numbers involved, with numbers having a range that appeared to not lend itself to easy conversion to our fixed-point-library (shameless plug here seems in order; it's LGPL: ) i.e. there was no suitable "fixed point" position within 32 bits. Worse, this particular CRIS CPU does not have fast multiplication; it has the "CRIS v8" ISA. The golden model for the control loop used "double", but a raw cycle count using that type showed 117836 cycles (average over a data set of 1000 iterations, using the CRIS simulator co-habiting with the gdb project). The budget was 25000. Thankfully, 24 bits precision proved sufficient. Switching to "float" make the number of cycles shrink to 46974. Still a long way to go. Switching to a separate multilib made sense, as this CRIS version has a leading-zeros-count insn which did not exist in the base version, and libgcc makes use of that, both for the older fp-bit.c and for the newer soft-fp floating-point libraries. CRIS used the older fp-bit library only because I've never found reason to try the newer soft-fp before, but I recalled promises of big performance improvements. The multilib did help some, but not much; just taking the number down to 45741 cycles (some 2%). A special umulsidi3 function and an improved longlong.h helped more, down to 43266 cycles; about 5%. Then I tried soft-fp and... Bam! The number of cycles went down to 23318, a 46% improvement. Arguably, there was a downsize: the size of the program went up, from 10901 to 13741 bytes. More tweaks included in the patch (arit.c, mulsi3.S) showed small improvements, down to 23236 cycles. If you're in this situation and using newlib, you'll also find that using __ieee754_sqrtf instead of sqrtf helps and also to use -ffast-math if your formulas allows that. Though, those changes just took the number of cycles down to 22951 (1%). Use of the internal sqrt-function makes more sense from a size perspective, as the wrapper function pulls in conversion support from and to double (obvious from the code). The difference is 13713 to 11035 bytes (static data and program), which matters more when you know that the size of the associated internal RAM is 16 KiB and that those previous numbers exclude startup code, stack and heap. Checking afterwards whether the separate multilib still made sense, it seems that without it, the number of cycles would be 25158 (with all other improvements in), so yes. Apparently soft-fp makes more use of leading-zeros-count than fp-bit. A belated bottom-line advice: port maintainers who have not already done so, switch to soft-fp from fp-bit if floating-point performance ever may matter to your port. If you're worried it'll bloat the code, that's your decision, but you can actually get something like twice the speed *at the application level* for a 1/4 size increase. If you're worried the library doesn't have the knobs you want for your port, it does have them; a lot more than fp-bit. Have a look: are several different core code choices for each of multiplication, division and addition. And as seen in the patch, I didn't even have to turn those knobs. (Also, the goal of the target is met and I'm out of time. Later.) Tested to not regress for cris-elf (all multilibs except v32) and crisv32-elf for r203561. Also built crisv32-linux-gnu for sanity-checking, but that was for a local import done long ago, 4.7-era. gcc: * config/cris/t-elfmulti (MULTILIB_OPTIONS, MULTILIB_DIRNAMES) (MULTILIB_MATCHES): Add multilib for -march=v8. libgcc: For CRIS ports, switch to soft-fp. Improve arit.c and longlong.h. * config.host (cpu_type) : Add entry for crisv32-*-*. (tmake_file) : Adjust. * longlong.h: Wrap the whole CRIS section in a single defined(__CRIS__) conditional. Add comment about add_ssaaaa and sub_ddmmss. (COUNT_LEADING_ZEROS_0): Define when count_leading_zeros is defined. [__CRIS__] (__umulsidi3): Define. [__CRIS__] (umul_ppmm): Define in terms of __umulsidi3. * config/cris/sfp-machine.h: New file. * config/cris/umulsidi3.S: New file. * config/cris/t-elfmulti (LIB2ADD_ST): Add umulsidi3.S. * config/cris/arit.c (SIGNMULT): New macro. (__Div, __Mod): Use SIGNMULT instead of naked multiplication. * config/cris/mulsi3.S: Tweak to avoid redundant register-copying; saving 3 out of originally 33 cycles from the fastest path, 3 out of 54 from the medium path and one from the longest path. Improve comments. brgds, H-P diff --git a/gcc/config/cris/t-elfmulti b/gcc/config/cris/t-elfmulti index 29ed57d..8bdbc55 100644 --- a/gcc/config/cris/t-elfmulti +++ b/gcc/config/cris/t-elfmulti @@ -16,9 +16,10 @@ # along with GCC; see the file COPYING3. If not see # . -MULTILIB_OPTIONS = march=v10/march=v32 -MULTILIB_DIRNAMES = v10 v32 +MULTILIB_OPTIONS = march=v8/march=v10/march=v32 +MULTILIB_DIRNAMES = v8 v10 v32 MULTILIB_MATCHES = \ + march?v8=mcpu?v8 \ march?v10=mcpu?etrax100lx \ march?v10=mcpu?ng \ march?v10=march?etrax100lx \ diff --git a/libgcc/config.host b/libgcc/config.host index c853a12..841167b 100644 --- a/libgcc/config.host +++ b/libgcc/config.host @@ -100,6 +100,9 @@ bfin*-*) ;; cr16-*-*) ;; +crisv32-*-*) + cpu_type=cris + ;; fido-*-*) cpu_type=m68k ;; @@ -422,13 +425,13 @@ cr16-*-elf) extra_parts="$extra_parts crti.o crtn.o crtlibid.o" ;; crisv32-*-elf) - tmake_file="$tmake_file cris/t-cris t-fdpbit" + tmake_file="$tmake_file cris/t-cris t-softfp-sfdf t-softfp" ;; cris-*-elf) - tmake_file="$tmake_file cris/t-cris t-fdpbit cris/t-elfmulti" + tmake_file="$tmake_file cris/t-cris t-softfp-sfdf t-softfp cris/t-elfmulti" ;; cris-*-linux* | crisv32-*-linux*) - tmake_file="$tmake_file cris/t-cris t-fdpbit cris/t-linux" + tmake_file="$tmake_file cris/t-cris t-softfp-sfdf t-softfp cris/t-linux" ;; epiphany-*-elf*) tmake_file="epiphany/t-epiphany t-fdpbit epiphany/t-custom-eqsf" diff --git a/libgcc/config/cris/arit.c b/libgcc/config/cris/arit.c index 32255f9..21bec66 100644 --- a/libgcc/config/cris/arit.c +++ b/libgcc/config/cris/arit.c @@ -39,6 +39,14 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #define LZ(v) __builtin_clz (v) #endif +/* In (at least) the 4.7 series, GCC doesn't automatically choose the + most optimal strategy, possibly related to insufficient modelling of + delay-slot costs. */ +#if defined (__CRIS_arch_version) && __CRIS_arch_version >= 10 +#define SIGNMULT(s, a) ((s) * (a)) /* Cheap multiplication, better than branch. */ +#else +#define SIGNMULT(s, a) ((s) < 0 ? -(a) : (a)) /* Branches are still better. */ +#endif #if defined (L_udivsi3) || defined (L_divsi3) || defined (L_umodsi3) \ || defined (L_modsi3) @@ -199,6 +207,7 @@ __Div (long a, long b) { long extra = 0; long sign = (b < 0) ? -1 : 1; + long res; /* We need to handle a == -2147483648 as expected and must while doing that avoid producing a sequence like "abs (a) < 0" as GCC @@ -214,15 +223,14 @@ __Div (long a, long b) if ((a & 0x7fffffff) == 0) { /* We're at 0x80000000. Tread carefully. */ - a -= b * sign; + a -= SIGNMULT (sign, b); extra = sign; } a = -a; } - /* We knowingly penalize pre-v10 models by multiplication with the - sign. */ - return sign * do_31div (a, __builtin_labs (b)).quot + extra; + res = do_31div (a, __builtin_labs (b)).quot; + return SIGNMULT (sign, res) + extra; } #endif /* L_divsi3 */ @@ -274,6 +282,7 @@ long __Mod (long a, long b) { long sign = 1; + long res; /* We need to handle a == -2147483648 as expected and must while doing that avoid producing a sequence like "abs (a) < 0" as GCC @@ -291,7 +300,8 @@ __Mod (long a, long b) a = -a; } - return sign * do_31div (a, __builtin_labs (b)).rem; + res = do_31div (a, __builtin_labs (b)).rem; + return SIGNMULT (sign, res); } #endif /* L_modsi3 */ #endif /* L_udivsi3 || L_divsi3 || L_umodsi3 || L_modsi3 */ diff --git a/libgcc/config/cris/mulsi3.S b/libgcc/config/cris/mulsi3.S index 76dfb63..213ed90 100644 --- a/libgcc/config/cris/mulsi3.S +++ b/libgcc/config/cris/mulsi3.S @@ -113,16 +113,22 @@ ___Mul: ret nop #else - move.d $r10,$r12 +;; See if we can avoid multiplying some of the parts, knowing +;; they're zero. + move.d $r11,$r9 - bound.d $r12,$r9 + bound.d $r10,$r9 cmpu.w 65535,$r9 bls L(L3) - move.d $r12,$r13 + move.d $r10,$r12 - movu.w $r11,$r9 +;; Nope, have to do all the parts of a 32-bit multiplication. +;; See head comment in optabs.c:expand_doubleword_mult. + + move.d $r10,$r13 + movu.w $r11,$r9 ; ab*cd = (a*d + b*c)<<16 + b*d lslq 16,$r13 - mstep $r9,$r13 + mstep $r9,$r13 ; d*b mstep $r9,$r13 mstep $r9,$r13 mstep $r9,$r13 @@ -140,7 +146,7 @@ ___Mul: mstep $r9,$r13 clear.w $r10 test.d $r10 - mstep $r9,$r10 + mstep $r9,$r10 ; d*a mstep $r9,$r10 mstep $r9,$r10 mstep $r9,$r10 @@ -157,10 +163,9 @@ ___Mul: mstep $r9,$r10 mstep $r9,$r10 movu.w $r12,$r12 - move.d $r11,$r9 - clear.w $r9 - test.d $r9 - mstep $r12,$r9 + clear.w $r11 + move.d $r11,$r9 ; Doubles as a "test.d" preparing for the mstep. + mstep $r12,$r9 ; b*c mstep $r12,$r9 mstep $r12,$r9 mstep $r12,$r9 @@ -182,17 +187,24 @@ ___Mul: add.d $r13,$r10 L(L3): - move.d $r9,$r10 +;; Form the maximum in $r10, by knowing the minimum, $r9. +;; (We don't know which one of $r10 or $r11 it is.) +;; Check if the largest operand is still just 16 bits. + + xor $r9,$r10 xor $r11,$r10 - xor $r12,$r10 cmpu.w 65535,$r10 bls L(L5) movu.w $r9,$r13 - movu.w $r13,$r13 +;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but c==0 +;; so we only need (a*d)<<16 + b*d with d = $r13, ab = $r10. +;; We drop the upper part of (a*d)<<16 as we're only doing a +;; 32-bit-result multiplication. + move.d $r10,$r9 lslq 16,$r9 - mstep $r13,$r9 + mstep $r13,$r9 ; b*d mstep $r13,$r9 mstep $r13,$r9 mstep $r13,$r9 @@ -210,7 +222,7 @@ L(L3): mstep $r13,$r9 clear.w $r10 test.d $r10 - mstep $r13,$r10 + mstep $r13,$r10 ; a*d mstep $r13,$r10 mstep $r13,$r10 mstep $r13,$r10 @@ -231,25 +243,27 @@ L(L3): add.d $r9,$r10 L(L5): - movu.w $r9,$r9 +;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but a and c==0 +;; so b*d (with b=$r13, a=$r10) it is. + lslq 16,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 - mstep $r9,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 ret - mstep $r9,$r10 + mstep $r13,$r10 #endif L(Lfe1): .size ___Mul,L(Lfe1)-___Mul diff --git a/libgcc/config/cris/sfp-machine.h b/libgcc/config/cris/sfp-machine.h new file mode 100644 index 0000000..0d52a70 --- /dev/null +++ b/libgcc/config/cris/sfp-machine.h @@ -0,0 +1,78 @@ +/* Soft-FP definitions for CRIS. + Copyright (C) 2013 Free Software Foundation, Inc. + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free +Software Foundation; either version 3, or (at your option) any later +version. + +GCC is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +for more details. + +Under Section 7 of GPL version 3, you are granted additional +permissions described in the GCC Runtime Library Exception, version +3.1, as published by the Free Software Foundation. + +You should have received a copy of the GNU General Public License and +a copy of the GCC Runtime Library Exception along with this program; +see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +. */ + +#define _FP_W_TYPE_SIZE 32 +#define _FP_W_TYPE unsigned long +#define _FP_WS_TYPE signed long +#define _FP_I_TYPE long + +/* The type of the result of a floating point comparison. This must + match `__libgcc_cmp_return__' in GCC for the target. */ +typedef int __gcc_CMPtype __attribute__ ((mode (__libgcc_cmp_return__))); +#define CMPtype __gcc_CMPtype + +/* FIXME: none of the *MEAT* macros have actually been benchmarked to be + better than any other choice for any CRIS variant. */ + +#define _FP_MUL_MEAT_S(R,X,Y) \ + _FP_MUL_MEAT_1_wide(_FP_WFRACBITS_S,R,X,Y,umul_ppmm) +#define _FP_MUL_MEAT_D(R,X,Y) \ + _FP_MUL_MEAT_2_wide(_FP_WFRACBITS_D,R,X,Y,umul_ppmm) + +#define _FP_DIV_MEAT_S(R,X,Y) _FP_DIV_MEAT_1_loop(S,R,X,Y) +#define _FP_DIV_MEAT_D(R,X,Y) _FP_DIV_MEAT_2_udiv(D,R,X,Y) + +#define _FP_NANFRAC_S ((_FP_QNANBIT_S << 1) - 1) +#define _FP_NANFRAC_D ((_FP_QNANBIT_D << 1) - 1), -1 +#define _FP_NANSIGN_S 0 +#define _FP_NANSIGN_D 0 +#define _FP_QNANNEGATEDP 0 +#define _FP_KEEPNANFRACP 1 + +/* Someone please check this. */ +#define _FP_CHOOSENAN(fs, wc, R, X, Y, OP) \ + do { \ + if ((_FP_FRAC_HIGH_RAW_##fs(X) & _FP_QNANBIT_##fs) \ + && !(_FP_FRAC_HIGH_RAW_##fs(Y) & _FP_QNANBIT_##fs)) \ + { \ + R##_s = Y##_s; \ + _FP_FRAC_COPY_##wc(R,Y); \ + } \ + else \ + { \ + R##_s = X##_s; \ + _FP_FRAC_COPY_##wc(R,X); \ + } \ + R##_c = FP_CLS_NAN; \ + } while (0) + +#define __LITTLE_ENDIAN 1234 +#define __BIG_ENDIAN 4321 + +# define __BYTE_ORDER __LITTLE_ENDIAN + +/* Define ALIASNAME as a strong alias for NAME. */ +# define strong_alias(name, aliasname) _strong_alias(name, aliasname) +# define _strong_alias(name, aliasname) \ + extern __typeof (name) aliasname __attribute__ ((alias (#name))); diff --git a/libgcc/config/cris/t-elfmulti b/libgcc/config/cris/t-elfmulti index b180521..308ef51 100644 --- a/libgcc/config/cris/t-elfmulti +++ b/libgcc/config/cris/t-elfmulti @@ -1,3 +1,3 @@ -LIB2ADD_ST = $(srcdir)/config/cris/mulsi3.S +LIB2ADD_ST = $(srcdir)/config/cris/mulsi3.S $(srcdir)/config/cris/umulsidi3.S CRTSTUFF_T_CFLAGS = -moverride-best-lib-options diff --git a/libgcc/config/cris/umulsidi3.S b/libgcc/config/cris/umulsidi3.S new file mode 100644 index 0000000..bf9858d --- /dev/null +++ b/libgcc/config/cris/umulsidi3.S @@ -0,0 +1,289 @@ +;; Copyright (C) 2001, 2004, 2013 Free Software Foundation, Inc. +;; +;; This file is part of GCC. +;; +;; GCC is free software; you can redistribute it and/or modify it under +;; the terms of the GNU General Public License as published by the Free +;; Software Foundation; either version 3, or (at your option) any later +;; version. +;; +;; GCC is distributed in the hope that it will be useful, but WITHOUT ANY +;; WARRANTY; without even the implied warranty of MERCHANTABILITY or +;; FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +;; for more details. +;; +;; Under Section 7 of GPL version 3, you are granted additional +;; permissions described in the GCC Runtime Library Exception, version +;; 3.1, as published by the Free Software Foundation. +;; +;; You should have received a copy of the GNU General Public License and +;; a copy of the GCC Runtime Library Exception along with this program; +;; see the files COPYING3 and COPYING.RUNTIME respectively. If not, see +;; . +;; +;; This code is derived from mulsi3.S, observing that the mstep*16-based +;; multiplications there, from which it is formed, are actually +;; zero-extending; in gcc-speak "umulhisi3". The difference to *this* +;; function is just a missing top mstep*16 sequence and shifts and 64-bit +;; additions for the high part. Compared to an implementation based on +;; calling __Mul four times (see default implementation of umul_ppmm in +;; longlong.h), this will complete in a time between a fourth and a third +;; of that, assuming the value-based optimizations don't strike. If they +;; all strike there (very often) but none here, we still win, though by a +;; lesser margin, due to lesser total overhead. + +#define L(x) .x +#define CONCAT1(a, b) CONCAT2(a, b) +#define CONCAT2(a, b) a ## b + +#ifdef __USER_LABEL_PREFIX__ +# define SYM(x) CONCAT1 (__USER_LABEL_PREFIX__, x) +#else +# define SYM(x) x +#endif + + .global SYM(__umulsidi3) + .type SYM(__umulsidi3),@function +SYM(__umulsidi3): +#if defined (__CRIS_arch_version) && __CRIS_arch_version >= 10 +;; Can't have the mulu.d last on a cache-line, due to a hardware bug. See +;; the documentation for -mmul-bug-workaround. +;; Not worthwhile to conditionalize here. + .p2alignw 2,0x050f + mulu.d $r11,$r10 + ret + move $mof,$r11 +#else + move.d $r11,$r9 + bound.d $r10,$r9 + cmpu.w 65535,$r9 + bls L(L3) + move.d $r10,$r12 + + move.d $r10,$r13 + movu.w $r11,$r9 ; ab*cd = (a*c)<<32 (a*d + b*c)<<16 + b*d + +;; We're called for floating point numbers very often with the "low" 16 +;; bits zero, so it's worthwhile to optimize for that. + + beq L(L6) ; d == 0? + lslq 16,$r13 + + beq L(L7) ; b == 0? + clear.w $r10 + + mstep $r9,$r13 ; d*b + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + mstep $r9,$r13 + +L(L7): + test.d $r10 + mstep $r9,$r10 ; d*a + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + mstep $r9,$r10 + +;; d*a in $r10, d*b in $r13, ab in $r12 and cd in $r11 +;; $r9 = d, need to do b*c and a*c; we can drop d. +;; so $r9 is up for use and we can shift down $r11 as the mstep +;; source for the next mstep-part. + +L(L8): + lsrq 16,$r11 + move.d $r12,$r9 + lslq 16,$r9 + beq L(L9) ; b == 0? + mstep $r11,$r9 + + mstep $r11,$r9 ; b*c + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 + mstep $r11,$r9 +L(L9): + +;; d*a in $r10, d*b in $r13, c*b in $r9, ab in $r12 and c in $r11, +;; need to do a*c. We want that to end up in $r11, so we shift up $r11 to +;; now use as the destination operand. We'd need a test insn to update N +;; to do it the other way round. + + lsrq 16,$r12 + lslq 16,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + mstep $r12,$r11 + +;; d*a in $r10, d*b in $r13, c*b in $r9, a*c in $r11 ($r12 free). +;; Need (a*d + b*c)<<16 + b*d into $r10 and +;; a*c + (a*d + b*c)>>16 plus carry from the additions into $r11. + + add.d $r9,$r10 ; (a*d + b*c) - may produce a carry. + scs $r12 ; The carry corresponds to bit 16 of $r11. + lslq 16,$r12 + add.d $r12,$r11 ; $r11 = a*c + carry from (a*d + b*c). + +#if defined (__CRIS_arch_version) && __CRIS_arch_version >= 8 + swapw $r10 + addu.w $r10,$r11 ; $r11 = a*c + (a*d + b*c) >> 16 including carry. + clear.w $r10 ; $r10 = (a*d + b*c) << 16 +#else + move.d $r10,$r9 + lsrq 16,$r9 + add.d $r9,$r11 ; $r11 = a*c + (a*d + b*c) >> 16 including carry. + lslq 16,$r10 ; $r10 = (a*d + b*c) << 16 +#endif + add.d $r13,$r10 ; $r10 = (a*d + b*c) << 16 + b*d - may produce a carry. + scs $r9 + ret + add.d $r9,$r11 ; Last carry added to the high-order 32 bits. + +L(L6): + clear.d $r13 + ba L(L8) + clear.d $r10 + +L(L11): + clear.d $r10 + ret + clear.d $r11 + +L(L3): +;; Form the maximum in $r10, by knowing the minimum, $r9. +;; (We don't know which one of $r10 or $r11 it is.) +;; Check if the largest operand is still just 16 bits. + + xor $r9,$r10 + xor $r11,$r10 + cmpu.w 65535,$r10 + bls L(L5) + movu.w $r9,$r13 + +;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but c==0 +;; so we only need (a*d)<<16 + b*d with d = $r13, ab = $r10. +;; Remember that the upper part of (a*d)<<16 goes into the lower part +;; of $r11 and there may be a carry from adding the low 32 parts. + beq L(L11) ; d == 0? + move.d $r10,$r9 + + lslq 16,$r9 + beq L(L10) ; b == 0? + clear.w $r10 + + mstep $r13,$r9 ; b*d + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 + mstep $r13,$r9 +L(L10): + test.d $r10 + mstep $r13,$r10 ; a*d + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + move.d $r10,$r11 + lsrq 16,$r11 + lslq 16,$r10 + add.d $r9,$r10 + scs $r12 + ret + add.d $r12,$r11 + +L(L5): +;; We have ab*cd = (a*c)<<32 + (a*d + b*c)<<16 + b*d, but a and c==0 +;; so b*d (with min=b=$r13, max=d=$r10) it is. As it won't overflow the +;; 32-bit part, just set $r11 to 0. + + lslq 16,$r10 + clear.d $r11 + + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + mstep $r13,$r10 + ret + mstep $r13,$r10 +#endif +L(Lfe1): + .size SYM(__umulsidi3),L(Lfe1)-SYM(__umulsidi3) diff --git a/libgcc/longlong.h b/libgcc/longlong.h index 30cc2e3..24dbae4 100644 --- a/libgcc/longlong.h +++ b/libgcc/longlong.h @@ -272,12 +272,39 @@ UDItype __umulsidi3 (USItype, USItype); #endif /* defined (__AVR__) */ -#if defined (__CRIS__) && __CRIS_arch_version >= 3 +#if defined (__CRIS__) + +#if __CRIS_arch_version >= 3 #define count_leading_zeros(COUNT, X) ((COUNT) = __builtin_clz (X)) +#define COUNT_LEADING_ZEROS_0 32 +#endif /* __CRIS_arch_version >= 3 */ + #if __CRIS_arch_version >= 8 #define count_trailing_zeros(COUNT, X) ((COUNT) = __builtin_ctz (X)) -#endif -#endif /* __CRIS__ */ +#endif /* __CRIS_arch_version >= 8 */ + +#if __CRIS_arch_version >= 10 +#define __umulsidi3(u,v) ((UDItype)(USItype) (u) * (UDItype)(USItype) (v)) +#else +#define __umulsidi3 __umulsidi3 +extern UDItype __umulsidi3 (USItype, USItype); +#endif /* __CRIS_arch_version >= 10 */ + +#define umul_ppmm(w1, w0, u, v) \ + do { \ + UDItype __x = __umulsidi3 (u, v); \ + (w0) = (USItype) (__x); \ + (w1) = (USItype) (__x >> 32); \ + } while (0) + +/* FIXME: defining add_ssaaaa and sub_ddmmss should be advantageous for + DFmode ("double" intrinsics, avoiding two of the three insns handling + carry), but defining them as open-code C composing and doing the + operation in DImode (UDImode) shows that the DImode needs work: + register pressure from requiring neighboring registers and the + traffic to and from them come to dominate, in the 4.7 series. */ + +#endif /* defined (__CRIS__) */ #if defined (__hppa) && W_TYPE_SIZE == 32 #define add_ssaaaa(sh, sl, ah, al, bh, bl) \