Inline IBM long double __gcc_qsub

Message ID	CAGWvny=Xe8TXx1Z6QUABj85-UAjGrRDfUYJBdqQUR2Txp9gg3g@mail.gmail.com
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 76C073858D3C MIME-Version: 1.0 Date: Wed, 25 Aug 2021 20:23:32 -0400 Message-ID: <CAGWvny=Xe8TXx1Z6QUABj85-UAjGrRDfUYJBdqQUR2Txp9gg3g@mail.gmail.com> Subject: [PATCH] Inline IBM long double __gcc_qsub To: Segher Boessenkool <segher@kernel.crashing.org> Content-Type: text/plain; charset="UTF-8" Precedence: list From: David Edelsohn via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: David Edelsohn <dje.gcc@gmail.com> Cc: GCC Patches <gcc-patches@gcc.gnu.org> Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
Series	Inline IBM long double __gcc_qsub \| expand Inline IBM long double __gcc_qsub

Message ID

CAGWvny=Xe8TXx1Z6QUABj85-UAjGrRDfUYJBdqQUR2Txp9gg3g@mail.gmail.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 76C073858D3C
MIME-Version: 1.0
Date: Wed, 25 Aug 2021 20:23:32 -0400
Message-ID: 
 <CAGWvny=Xe8TXx1Z6QUABj85-UAjGrRDfUYJBdqQUR2Txp9gg3g@mail.gmail.com>
Subject: [PATCH] Inline IBM long double __gcc_qsub
To: Segher Boessenkool <segher@kernel.crashing.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: list
From: David Edelsohn via Gcc-patches <gcc-patches@gcc.gnu.org>
Reply-To: David Edelsohn <dje.gcc@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>

Series

Inline IBM long double __gcc_qsub | expand

Commit Message

David Edelsohn Aug. 26, 2021, 12:23 a.m. UTC

rs6000: inline ldouble __gcc_qsub

    While performing some tests of IEEE 128 float for PPC64LE, Michael
    Meissner noticed that __gcc_qsub is substantially slower than
    __gcc_qadd.  __gcc_qsub valls __gcc_add with the second operand
    negated.  Because the functions normally are invoked through
    libgcc shared object, the extra PLT overhead has a large impact
    on the overall time of the function.  Instead of trying to be
    fancy with function decorations to prevent interposition, this
    patch inlines the definition of __gcc_qadd into __gcc_qsub with
    the negation propagated through the function.

    libgcc/ChangeLog:

            * config/rs6000/ibm-ldouble.c (__gcc_qsub): Inline negated
__gcc_qadd.

Comments

Andreas Schwab Aug. 26, 2021, 7:35 a.m. UTC | #1

On Aug 25 2021, David Edelsohn via Gcc-patches wrote:

>     rs6000: inline ldouble __gcc_qsub
>
>     While performing some tests of IEEE 128 float for PPC64LE, Michael
>     Meissner noticed that __gcc_qsub is substantially slower than
>     __gcc_qadd.  __gcc_qsub valls __gcc_add with the second operand
                                    __gcc_qadd

Andreas.

Andreas Schwab Aug. 26, 2021, 7:40 a.m. UTC | #2

On Aug 25 2021, David Edelsohn via Gcc-patches wrote:

>     rs6000: inline ldouble __gcc_qsub
>
>     While performing some tests of IEEE 128 float for PPC64LE, Michael
>     Meissner noticed that __gcc_qsub is substantially slower than
>     __gcc_qadd.  __gcc_qsub valls __gcc_add with the second operand
>     negated.  Because the functions normally are invoked through
>     libgcc shared object, the extra PLT overhead has a large impact
>     on the overall time of the function.  Instead of trying to be
>     fancy with function decorations to prevent interposition, this
>     patch inlines the definition of __gcc_qadd into __gcc_qsub with
>     the negation propagated through the function.
>
>     libgcc/ChangeLog:
>
>             * config/rs6000/ibm-ldouble.c (__gcc_qsub): Inline negated
> __gcc_qadd.

How about defining a static function that is used by both?

Andreas.

Segher Boessenkool Aug. 26, 2021, 12:15 p.m. UTC | #3

Hi!

On Wed, Aug 25, 2021 at 08:23:32PM -0400, David Edelsohn wrote:
>     rs6000: inline ldouble __gcc_qsub
> 
>     While performing some tests of IEEE 128 float for PPC64LE, Michael
>     Meissner noticed that __gcc_qsub is substantially slower than
>     __gcc_qadd.  __gcc_qsub valls __gcc_add with the second operand

("calls", "__gcc_qadd")

>     negated.  Because the functions normally are invoked through
>     libgcc shared object, the extra PLT overhead has a large impact
>     on the overall time of the function.  Instead of trying to be
>     fancy with function decorations to prevent interposition, this
>     patch inlines the definition of __gcc_qadd into __gcc_qsub with
>     the negation propagated through the function.

Looks good to me, and it is a good way to resolve this.  This code is
too old (and unimportant) to do serious engineering on.  If we want
any serious optimisation on it we should do that at tree level (why does
that not happen yet anyway?), and inline all of this.  This patch is
really just to make benchmark results saner ;-)

Thanks David!

Segher

diff --git a/libgcc/config/rs6000/ibm-ldouble.c
b/libgcc/config/rs6000/ibm-ldouble.c
index 4c13453f975..ed74900e5c3 100644
--- a/libgcc/config/rs6000/ibm-ldouble.c
+++ b/libgcc/config/rs6000/ibm-ldouble.c
@@ -158,9 +158,42 @@  __gcc_qadd (double a, double aa, double c, double cc)
 }

 IBM128_TYPE
-__gcc_qsub (double a, double b, double c, double d)
+__gcc_qsub (double a, double aa, double c, double cc)
 {
-  return __gcc_qadd (a, b, -c, -d);
+  double xh, xl, z, q, zz;
+
+  z = a - c;
+
+  if (nonfinite (z))
+    {
+      if (fabs (z) != inf())
+       return z;
+      z = -cc + aa - c + a;
+      if (nonfinite (z))
+       return z;
+      xh = z;  /* Will always be DBL_MAX.  */
+      zz = aa - cc;
+      if (fabs(a) > fabs(c))
+       xl = a - z - c + zz;
+      else
+       xl = -c - z + a + zz;
+    }
+  else
+    {
+      q = a - z;
+      zz = q - c + (a - (q + z)) + aa - cc;
+
+      /* Keep -0 result.  */
+      if (zz == 0.0)
+       return z;
+
+      xh = z + zz;
+      if (nonfinite (xh))
+       return xh;
+
+      xl = z - xh + zz;
+    }
+  return pack_ldouble (xh, xl);
 }

 #ifdef __NO_FPRS__