From patchwork Fri May  8 15:09:39 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
X-Patchwork-Id: 470077
Return-Path: 
 <gcc-patches-return-397245-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 2B62F140281
	for <incoming@patchwork.ozlabs.org>;
	Sat,  9 May 2015 01:09:57 +1000 (AEST)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b=CEncfaLb; dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:references
	:in-reply-to:content-type; q=dns; s=default; b=sA9CHOf2in4egPHIu
	sifHuvha1QZ81xlE9m9VWAEXOkl2tQ5qZv5KDsIuA7q3Urtsv4ohnl9W0aBVFg6H
	uyM8jvOq3yWQ1FZM+V5VU49x6A4dSzdMOr7enj6azgpqQJuvLyqe05j1F8P7dllc
	A2Hpa7rMDotIFz4qd+RmaISMEc=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:references
	:in-reply-to:content-type; s=default; bh=89wolLxKkwN0EAVDqTgVf5/
	04m0=; b=CEncfaLbUaqdSgurK1va0g8IqtPBjkb+WKcwx0aR16NolH3xq0nVN78
	qxPtVJGWMr9YlarAMzj1qcckJrgjjKYms634kNjmWAv6qH6e6Z12yU+2fURibk4F
	Hd/EW19Hi/uhNV6y9vdQb1Sf1oeshGq1b+jsWC8brmQKshEoqy7U=
Received: (qmail 36746 invoked by alias); 8 May 2015 15:09:49 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 36735 invoked by uid 89); 8 May 2015 15:09:49 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=0.4 required=5.0 tests=AWL, BAYES_05,
	KAM_LAZY_DOMAIN_SECURITY,
	T_RP_MATCHES_RCVD autolearn=no version=3.3.2
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Fri, 08 May 2015 15:09:43 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by
	usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E2C5629;
	Fri,  8 May 2015 08:09:08 -0700 (PDT)
Received: from [10.2.207.50] (e100706-lin.cambridge.arm.com
	[10.2.207.50])	by usa-sjc-imap-foss1.foss.arm.com (Postfix)
	with ESMTPSA id 8778B3F251; Fri,  8 May 2015 08:09:41 -0700 (PDT)
Message-ID: <554CD1B3.1060300@foss.arm.com>
Date: Fri, 08 May 2015 16:09:39 +0100
From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>,
	Richard Biener <richard.guenther@gmail.com>
CC: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH][tree-ssa-math-opts] Expand pow (x,
	CONST) using square roots when possible
References: <5543A3AB.1090309@foss.arm.com>
	<CAFiYyc3uz9-m0g5FB2YPKy5Vef-mgS7PJ5qUtsvw8=0VwzenZQ@mail.gmail.com>
	<554CC0A6.8020308@foss.arm.com>
In-Reply-To: <554CC0A6.8020308@foss.arm.com>

On 08/05/15 14:56, Kyrill Tkachov wrote:
> On 08/05/15 11:18, Richard Biener wrote:
>> On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com> wrote:
>>> Hi all,
>>>
>>> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
>>> (int)k + 0.5)
>>> using square roots. So, for the above examples it would generate sqrt (x) *
>>> sqrt (sqrt (x)),
>>> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
>>> will calculate the
>>> reciprocal of that).
>>>
>>> However, the implementation of these optimisations is done on a bit of an
>>> ad-hoc basis with
>>> the 0.25, 0.5, 0.75 cases hardcoded.
>>> Judging by
>>> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
>>> these are the most commonly used exponents (at least in SPEC ;))
>>>
>>> This patch generalises this optimisation into a (hopefully) more robust
>>> algorithm.
>>> In particular, it expands calls to pow (x, CST) by expanding the integer
>>> part of CST
>>> using a powi, like it does already, and then expanding the fractional part
>>> as a product
>>> of repeated applications of a square root if the fractional part can be
>>> expressed
>>> as a multiple of a power of 0.5.
>>>
>>> I try to explain the algorithm in more detail in the comments in the patch
>>> but, for example:
>>>
>>> pow (x, 5.625) is not currently handled, but with this patch will be
>>> expanded
>>> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
>>> 0.5 + 0.5**3
>>>
>>> Negative exponents are handled in either of two ways, depending on the
>>> exponent value:
>>> * Using a simple reciprocal.
>>>     For example:
>>>     pow (x, -5.625) == 1.0 / pow (x, 5.625)
>>>       --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>>>
>>> * For pow (x, EXP) with negative exponent EXP with integer part INT and
>>> fractional part FRAC:
>>> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>>>     For example:
>>>     pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>>>       --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>>>
>>>
>>> Since hardware square root instructions tend to be expensive, we may want to
>>> reduce the number
>>> of square roots we are willing to calculate. Since we reuse intermediate
>>> square root results,
>>> this boils down to restricting the depth of the square root chains. In all
>>> the examples above
>>> that depth is 3. I've made this maximum depth parametrisable in params.def.
>>> By adjusting that
>>> parameter we can adjust the resolution of this optimisation. So, if it's set
>>> to '4' then we
>>> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
>>> including negative
>>> multiples. Currently, GCC will not try to expand negative multiples of
>>> anything else than 0.5
>>>
>>> I have tried to keep the existing functionality intact and activate this
>>> only for
>>> -funsafe-math-optimizations and only when the target has a sqrt instruction.
>>>    An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
>>> even
>>> when a hardware sqrt is not available, presumably because the library
>>> function for
>>> sqrt is usually faster than pow (?).
>> Yes.  It's also a safe transform - which you seem to put under
>> flag_unsafe_math_optimizations only with your patch.
>>
>> It would be clearer to just leave the special-case
>>
>> -  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
>> -     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
>> -     sqrt(-0) = -0.  */
>> -  if (sqrtfn
>> -      && REAL_VALUES_EQUAL (c, dconsthalf)
>> -      && !HONOR_SIGNED_ZEROS (mode))
>> -    return build_and_insert_call (gsi, loc, sqrtfn, arg0);
>>
>> in as-is.
> Ok, I'll leave that case explicit.
>
>> You also removed the Os constraint which you should put back in.
>> Basically if !optimize_function_for_speed_p then generate at most
>> two calls to sqrt (iff the HW has a sqrt instruction).
> I tried to move that logic into expand_with_sqrts but
> I'll move it outside it. It seems that this boils down to
> only 0.25, as any other 2xsqrt chain will also involve a
> multiply or a divide which we currently avoid.
>
>> You fail to add a testcase that checks that the optimization applies.
> I'll add one to scan the sincos dump.
> I notice that we don't have a testuite check that the target has
> a hw sqrt instructions. Would you like me to add one? Or can I make
> the testcase aarch64-specific?
>
>> Otherwise the idea looks good though there must be a better way
>> to compute the series than by using real-arithmetic and forcefully
>> trying out all possibilities...
> I get that feeling too. What I need is not only a way
> of figuring out if the fractional part of the exponent can be
> represented in this way, but also compute the depth of the
> sqrt chain and the number of multiplies...
> That being said, the current approach is O(maximum depth) and
> I don't expect the depth to go much beyond 3 or 4 in practice.
>
> Thanks for looking at it!
> I'll respin the patch.

And here it is, with my above comments implemented.
Bootstrapped on x86_64 and tested on aarch64.
Full testing on arm and aarch64 ongoing.

Is this ok if testing comes clean?

Thanks,
Kyrill

> Kyrill
>
>> Richard.
>>
>>> Having seen the glibc implementation of a fully IEEE-754-compliant pow
>>> function, I think we
>>> would prefer synthesising the pow call whenever we can for -ffast-math.
>>>
>>> I have seen this optimisation trigger a few times in SPEC2k6, in particular
>>> in 447.dealII
>>> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and
>>> pow (x, 0.875)
>>> with square roots, multiplies and, in the case of -0.25, divides.
>>> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>>>
>>> On 481.wrf on aarch64 I saw about a 1% improvement.
>>> The cycle count on x86_64 was also smaller, but not by a significant amount
>>> (the same calls to
>>> pow were eliminated).
>>>
>>> In general, I think this can shine if multiple expandable calls to pow
>>> appear together.
>>> So, for example for code:
>>> double
>>> baz (double a)
>>> {
>>>     return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow
>>> (a, 3.375);
>>> }
>>>
>>> we can generate:
>>> baz:
>>>           fsqrt   d3, d0
>>>           fmul    d4, d0, d0
>>>           fmov    d5, 1.0e+0
>>>           fmul    d6, d0, d4
>>>           fsqrt   d2, d3
>>>           fmul    d1, d0, d2
>>>           fsqrt   d0, d2
>>>           fmul    d3, d3, d2
>>>           fdiv    d1, d5, d1
>>>           fmul    d3, d3, d6
>>>           fmul    d2, d2, d0
>>>           fmadd   d0, d4, d3, d1
>>>           fmsub   d0, d6, d2, d0
>>>           ret
>>>
>>> reusing the sqrt results and doing more optimisations rather than the
>>> current:
>>> baz:
>>>           stp     x29, x30, [sp, -48]!
>>>           fmov    d1, -1.25e+0
>>>           add     x29, sp, 0
>>>           stp     d8, d9, [sp, 16]
>>>           fmov    d9, d0
>>>           str     d10, [sp, 32]
>>>           bl      pow
>>>           fmov    d8, d0
>>>           fmov    d0, d9
>>>           fmov    d1, 5.75e+0
>>>           bl      pow
>>>           fmov    d10, d0
>>>           fmov    d0, d9
>>>           fmov    d1, 3.375e+0
>>>           bl      pow
>>>           fadd    d8, d8, d10
>>>           ldr     d10, [sp, 32]
>>>           fsub    d0, d8, d0
>>>           ldp     d8, d9, [sp, 16]
>>>           ldp     x29, x30, [sp], 48
>>>           ret
>>>
>>>
>>> Of course gcc could already do that if the exponents were to fall in the set
>>> {0.25, 0.75, k+0.5}
>>> but with this patch that set can be greatly expanded.
>>>
>>> I suppose if we're really lucky we might even open up new vectorisation
>>> opportunities.
>>> For example:
>>> void
>>> vecfoo (float *a, float *b)
>>> {
>>>     for (int i = 0; i < N; i++)
>>>       a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
>>> }
>>>
>>> will now be vectorisable if we have a hardware vector sqrt instruction.
>>> Though I'm not sure
>>> how often this would appear in real code.
>>>
>>>
>>> I would like advice on some implementation details:
>>> - The new function representable_as_half_series_p is used to check if the
>>> fractional
>>> part of an exponent can be represented as a multiple of a power of 0.5. It
>>> does so
>>> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a
>>> smarter
>>> way of doing this, considering that REAL_VALUE_TYPE holds the bit
>>> representation of the
>>> floating point number?
>>>
>>> - Are there any correctness cases that I may have missed? This patch gates
>>> the optimisation
>>> on -funsafe-math-optimizations, but maybe there are some edge cases that I
>>> missed?
>>>
>>> - What should be the default value of the max-pow-sqrt-depth param? In this
>>> patch it's 5 which
>>> on second thought strikes me as a bit aggressive. To catch exponents that
>>> are multiples of
>>> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I
>>> suppose it depends on
>>> how likely more fine-grained powers are to appear in real code, how
>>> expensive the C library
>>> implementation of pow is, and how expensive are the sqrt instructions in
>>> hardware.
>>>
>>>
>>> Bootstrapped and tested on x86_64, aarch64, arm (pending)
>>> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that
>>> might
>>> be of interest to look at?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>       * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>>>       * tree-ssa-math-opts.c: Include params.h
>>>       (pow_synth_sqrt_info): New struct.
>>>       (representable_as_half_series_p): New function.
>>>       (get_fn_chain): Likewise.
>>>       (print_nested_fn): Likewise.
>>>       (dump_fractional_sqrt_sequence): Likewise.
>>>       (dump_integer_part): Likewise.
>>>       (expand_pow_as_sqrts): Likewise.
>>>       (gimple_expand_builtin_pow): Use above to attempt to expand
>>>       pow as series of square roots.  Removed now unused variables.
>>>
>>> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>       * gcc.dg/pow-sqrt.x: New file.
>>>       * gcc.dg/pow-sqrt-1.c: New test.
>>>       * gcc.dg/pow-sqrt-2.c: Likewise.
>>>       * gcc.dg/pow-sqrt-3.c: Likewise.

commit 1c55f44e3294b17ba113032c2d05bb9e09c8fc69
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Wed Apr 29 13:27:07 2015 +0100

    [tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

diff --git a/gcc/params.def b/gcc/params.def
index 48b39a2..3e4ba3a 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -262,6 +262,14 @@ DEFPARAM(PARAM_MAX_HOIST_DEPTH,
 	 "Maximum depth of search in the dominator tree for expressions to hoist",
 	 30, 0, 0)
 
+
+/* When synthesizing expnonentiation by a real constant operations using square
+   roots, this controls how deep sqrt chains we are willing to generate.  */
+DEFPARAM(PARAM_MAX_POW_SQRT_DEPTH,
+	 "max-pow-sqrt-depth",
+	 "Maximum depth of sqrt chains to use when synthesizing exponentiation by a real constant",
+	 5, 1, 32)
+
 /* This parameter limits the number of insns in a loop that will be unrolled,
    and by how much the loop is unrolled.
 
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-1.c b/gcc/testsuite/gcc.dg/pow-sqrt-1.c
new file mode 100644
index 0000000..0793b6f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-1.c
@@ -0,0 +1,6 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=5" } */
+
+#define EXPN (-6 * (0.5*0.5*0.5*0.5))
+
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-2.c b/gcc/testsuite/gcc.dg/pow-sqrt-2.c
new file mode 100644
index 0000000..b2fada4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-2.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=5" } */
+
+#define EXPN (-5.875)
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-3.c b/gcc/testsuite/gcc.dg/pow-sqrt-3.c
new file mode 100644
index 0000000..18c7231
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-3.c
@@ -0,0 +1,5 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ffast-math --param max-pow-sqrt-depth=3" } */
+
+#define EXPN (1.25)
+#include "pow-sqrt.x"
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt.x b/gcc/testsuite/gcc.dg/pow-sqrt.x
new file mode 100644
index 0000000..bd744c6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt.x
@@ -0,0 +1,30 @@
+
+extern void abort (void);
+
+
+__attribute__((noinline)) double
+real_pow (double x, double pow_exp)
+{
+  return __builtin_pow (x, pow_exp);
+}
+
+#define EPS (0.000000000000000000001)
+
+#define SYNTH_POW(X, Y) __builtin_pow (X, Y)
+volatile double arg;
+
+int
+main (void)
+{
+  double i_arg = 0.1;
+
+  for (arg = i_arg; arg < 100.0; arg += 1.0)
+    {
+      double synth_res = SYNTH_POW (arg, EXPN);
+      double real_res = real_pow (arg, EXPN);
+
+      if (__builtin_abs (SYNTH_POW (arg, EXPN) - real_pow (arg, EXPN)) > EPS)
+	abort ();
+    }
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
new file mode 100644
index 0000000..52514fb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
+
+
+double
+foo (double a)
+{
+  return __builtin_pow (a, -5.875);
+}
+
+double
+foof (double a)
+{
+  return __builtin_pow (a, 0.75f);
+}
+
+double
+bar (double a)
+{
+  return __builtin_pow (a, 1.0 + 0.00390625);
+}
+
+double
+baz (double a)
+{
+  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
+}
+
+#define N 256
+void
+vecfoo (double *a)
+{
+  for (int i = 0; i < N; i++)
+    a[i] = __builtin_pow (a[i], 1.25);
+}
+
+/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
+/* { dg-final { cleanup-tree-dump "sincos" } } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index c22a677..11288b1 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -143,6 +143,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "gimple-pretty-print.h"
 #include "builtins.h"
+#include "params.h"
 
 /* FIXME: RTL headers have to be included here for optabs.  */
 #include "rtl.h"		/* Because optabs.h wants enum rtx_code.  */
@@ -1148,6 +1149,357 @@ build_and_insert_cast (gimple_stmt_iterator *gsi, location_t loc,
   return result;
 }
 
+struct pow_synth_sqrt_info
+{
+  bool *factors;
+  unsigned int deepest;
+  unsigned int num_mults;
+};
+
+/* Return true iff the real value C can be represented as a
+   sum of powers of 0.5 up to N.  That is:
+   C == SUM<i from 1..N> (a[i]*(0.5**i)) where a[i] is either 0 or 1.
+   Record in INFO the various parameters of the synthesis algorithm such
+   as the factors a[i], the maximum 0.5 power and the number of
+   multiplications that will be required.  */
+
+bool
+representable_as_half_series_p (REAL_VALUE_TYPE c, unsigned n,
+				 struct pow_synth_sqrt_info *info)
+{
+  REAL_VALUE_TYPE factor = dconsthalf;
+  REAL_VALUE_TYPE remainder = c;
+
+  info->deepest = 0;
+  info->num_mults = 0;
+  memset (info->factors, 0, n * sizeof (bool));
+
+  for (unsigned i = 0; i < n; i++)
+    {
+      REAL_VALUE_TYPE res;
+
+      /* If something inexact happened bail out now.  */
+      if (REAL_ARITHMETIC (res, MINUS_EXPR, remainder, factor))
+	return false;
+
+      /* We have hit zero.  The number is representable as a sum
+         of powers of 0.5.  */
+      if (REAL_VALUES_EQUAL (res, dconst0))
+	{
+	  info->factors[i] = true;
+	  info->deepest = i + 1;
+	  return true;
+	}
+      else if (!REAL_VALUE_NEGATIVE (res))
+	{
+	  remainder = res;
+	  info->factors[i] = true;
+	  info->num_mults++;
+	}
+      else
+	info->factors[i] = false;
+
+      REAL_ARITHMETIC (factor, MULT_EXPR, factor, dconsthalf);
+    }
+  return false;
+}
+
+/* Return the tree corresponding to FN being applied
+   to ARG N times at GSI and LOC.
+   Look up previous results from CACHE if need be.
+   cache[0] should contain just plain ARG i.e. FN applied to ARG 0 times.  */
+
+static tree
+get_fn_chain (tree arg, unsigned int n, gimple_stmt_iterator *gsi,
+	      tree fn, location_t loc, tree *cache)
+{
+  tree res = cache[n];
+  if (!res)
+    {
+      tree prev = get_fn_chain (arg, n - 1, gsi, fn, loc, cache);
+      res = build_and_insert_call (gsi, loc, fn, prev);
+      cache[n] = res;
+    }
+
+  return res;
+}
+
+/* Print to STREAM the repeated application of function FNAME to ARG
+   N times.  So, for FNAME = "foo", ARG = "x", N = 2 it would print:
+   "foo (foo (x))".  */
+
+static void
+print_nested_fn (FILE* stream, const char *fname, const char* arg,
+		 unsigned int n)
+{
+  if (n == 0)
+    fprintf (stream, "%s", arg);
+  else
+    {
+      fprintf (stream, "%s (", fname);
+      print_nested_fn (stream, fname, arg, n - 1);
+      fprintf (stream, ")");
+    }
+}
+
+/* Print to STREAM the fractional sequence of sqrt chains
+   applied to ARG, described by INFO.  Used for the dump file.  */
+
+static void
+dump_fractional_sqrt_sequence (FILE *stream, const char *arg,
+			        struct pow_synth_sqrt_info *info)
+{
+  for (unsigned int i = 0; i < info->deepest; i++)
+    {
+      bool is_set = info->factors[i];
+      if (is_set)
+	{
+	  print_nested_fn (stream, "sqrt", arg, i + 1);
+	  if (i != info->deepest - 1)
+	    fprintf (stream, " * ");
+	}
+    }
+}
+
+/* Print to STREAM a representation of raising ARG to an integer
+   power N.  Used for the dump file.  */
+
+static void
+dump_integer_part (FILE *stream, const char* arg, HOST_WIDE_INT n)
+{
+  if (n > 1)
+    fprintf (stream, "powi (%s, " HOST_WIDE_INT_PRINT_DEC ")", arg, n);
+  else if (n == 1)
+    fprintf (stream, "%s", arg);
+}
+
+/* Attempt to synthesize a POW[F] (ARG0, ARG1) call using chains of
+   square roots.  Place at GSI and LOC.  Limit the maximum depth
+   of the sqrt chains to MAX_DEPTH.  Return the tree holding the
+   result of the expanded sequence or NULL_TREE if the expansion failed.
+
+   This routine assumes that ARG1 is a real number with a fractional part
+   (the integer exponent case will have been handled earlier in
+   gimple_expand_builtin_pow).
+
+   For ARG1 > 0.0:
+   * For ARG1 composed of a whole part WHOLE_PART and a fractional part
+     FRAC_PART i.e. WHOLE_PART == floor (ARG1) and
+                    FRAC_PART == ARG1 - WHOLE_PART:
+     Produce POWI (ARG0, WHOLE_PART) * POW (ARG0, FRAC_PART) where
+     POW (ARG0, FRAC_PART) is expanded as a product of square root chains
+     if it can be expressed as such, that is if FRAC_PART satisfies:
+     FRAC_PART == <SUM from i = 1 until MAX_DEPTH> (a[i] * (0.5**i))
+     where integer a[i] is either 0 or 1.
+
+     Example:
+     POW (x, 3.625) == POWI (x, 3) * POW (x, 0.625)
+       --> POWI (x, 3) * SQRT (x) * SQRT (SQRT (SQRT (x)))
+
+   For ARG1 < 0.0 there are two approaches:
+   * (A) Expand to 1.0 / POW (ARG0, -ARG1) where POW (ARG0, -ARG1)
+         is calculated as above.
+
+     Example:
+     POW (x, -5.625) == 1.0 / POW (x, 5.625)
+       -->  1.0 / (POWI (x, 5) * SQRT (x) * SQRT (SQRT (SQRT (x))))
+
+   * (B) : WHOLE_PART := - ceil (abs (ARG1))
+           FRAC_PART  := ARG1 - WHOLE_PART
+     and expand to POW (x, FRAC_PART) / POWI (x, WHOLE_PART).
+     Example:
+     POW (x, -5.875) == POW (x, 0.125) / POWI (X, 6)
+       --> SQRT (SQRT (SQRT (x))) / (POWI (x, 6))
+
+   For ARG1 < 0.0 we choose between (A) and (B) depending on
+   how many multiplications we'd have to do.
+   So, for the example in (B): POW (x, -5.875), if we were to
+   follow algorithm (A) we would produce:
+   1.0 / POWI (X, 5) * SQRT (X) * SQRT (SQRT (X)) * SQRT (SQRT (SQRT (X)))
+   which contains more multiplications than approach (B).
+
+   Hopefully, this approach will eliminate potentially expensive POW library
+   calls when unsafe floating point math is enabled and allow the compiler to
+   further optimise the multiplies, square roots and divides produced by this
+   function.  */
+
+static tree
+expand_pow_as_sqrts (gimple_stmt_iterator *gsi, location_t loc,
+		     tree arg0, tree arg1, HOST_WIDE_INT max_depth)
+{
+  tree type = TREE_TYPE (arg0);
+  machine_mode mode = TYPE_MODE (type);
+  tree sqrtfn = mathfn_built_in (type, BUILT_IN_SQRT);
+  bool one_over = true;
+
+  if (!sqrtfn)
+    return NULL_TREE;
+
+  if (TREE_CODE (arg1) != REAL_CST)
+    return NULL_TREE;
+
+  REAL_VALUE_TYPE exp_init = TREE_REAL_CST (arg1);
+
+  gcc_assert (max_depth > 0);
+  tree *cache = XALLOCAVEC (tree, max_depth + 1);
+
+  struct pow_synth_sqrt_info synth_info;
+  synth_info.factors = XALLOCAVEC (bool, max_depth + 1);
+  synth_info.deepest = 0;
+  synth_info.num_mults = 0;
+
+  bool neg_exp = REAL_VALUE_NEGATIVE (exp_init);
+  REAL_VALUE_TYPE exp = real_value_abs (&exp_init);
+
+  /* The whole and fractional parts of exp.  */
+  REAL_VALUE_TYPE whole_part;
+  REAL_VALUE_TYPE frac_part;
+
+  real_floor (&whole_part, mode, &exp);
+  REAL_ARITHMETIC (frac_part, MINUS_EXPR, exp, whole_part);
+
+
+  REAL_VALUE_TYPE ceil_whole = dconst0;
+  REAL_VALUE_TYPE ceil_fract = dconst0;
+
+  if (neg_exp)
+    {
+      real_ceil (&ceil_whole, mode, &exp);
+      REAL_ARITHMETIC (ceil_fract, MINUS_EXPR, ceil_whole, exp);
+    }
+
+  if (!representable_as_half_series_p (frac_part, max_depth, &synth_info))
+    return NULL_TREE;
+
+  /* Check whether it's more profitable to not use 1.0 / ...  */
+  if (neg_exp)
+    {
+      struct pow_synth_sqrt_info alt_synth_info;
+      alt_synth_info.factors = XALLOCAVEC (bool, max_depth + 1);
+      alt_synth_info.deepest = 0;
+      alt_synth_info.num_mults = 0;
+
+      if (representable_as_half_series_p (ceil_fract, max_depth,
+					   &alt_synth_info)
+	  && alt_synth_info.deepest <= synth_info.deepest
+	  && alt_synth_info.num_mults < synth_info.num_mults)
+	{
+	  whole_part = ceil_whole;
+	  frac_part = ceil_fract;
+	  synth_info.deepest = alt_synth_info.deepest;
+	  synth_info.num_mults = alt_synth_info.num_mults;
+	  memcpy (synth_info.factors, alt_synth_info.factors,
+		  (max_depth + 1) * sizeof (bool));
+	  one_over = false;
+	}
+    }
+
+  HOST_WIDE_INT n = real_to_integer (&whole_part);
+  REAL_VALUE_TYPE cint;
+  real_from_integer (&cint, VOIDmode, n, SIGNED);
+
+  if (!real_identical (&whole_part, &cint))
+    return NULL_TREE;
+
+  if (powi_cost (n) + synth_info.num_mults > POWI_MAX_MULTS)
+    return NULL_TREE;
+
+  memset (cache, 0, (max_depth + 1) * sizeof (tree));
+
+  tree integer_res = n == 0 ? build_real (type, dconst1) : arg0;
+
+  /* Calculate the integer part of the exponent.  */
+  if (n > 1)
+    {
+      integer_res = gimple_expand_builtin_powi (gsi, loc, arg0, n);
+      if (!integer_res)
+	return NULL_TREE;
+    }
+
+  if (dump_file)
+    {
+      char string[64];
+
+      real_to_decimal (string, &exp_init, sizeof (string), 0, 1);
+      fprintf (dump_file, "synthesizing pow (x, %s) as:\n", string);
+
+      if (neg_exp)
+	{
+	  if (one_over)
+	    {
+	      fprintf (dump_file, "1.0 / (");
+	      dump_integer_part (dump_file, "x", n);
+	      if (n > 0)
+	        fprintf (dump_file, " * ");
+	      dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	      fprintf (dump_file, ")");
+	    }
+	  else
+	    {
+	      dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	      fprintf (dump_file, " / (");
+	      dump_integer_part (dump_file, "x", n);
+	      fprintf (dump_file, ")");
+	    }
+	}
+      else
+	{
+	  dump_fractional_sqrt_sequence (dump_file, "x", &synth_info);
+	  if (n > 0)
+	    fprintf (dump_file, " * ");
+	  dump_integer_part (dump_file, "x", n);
+	}
+
+      fprintf (dump_file, "\ndeepest sqrt chain: %d\n", synth_info.deepest);
+    }
+
+
+  tree fract_res = NULL_TREE;
+  cache[0] = arg0;
+
+  /* Calculate the fractional part of the exponent.  */
+  for (unsigned i = 0; i < synth_info.deepest; i++)
+    {
+      if (synth_info.factors[i])
+	{
+	  tree sqrt_chain = get_fn_chain (arg0, i + 1, gsi, sqrtfn, loc, cache);
+
+	  if (!fract_res)
+	      fract_res = sqrt_chain;
+
+	  else
+	    fract_res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+					   fract_res, sqrt_chain);
+	}
+    }
+
+  tree res = NULL_TREE;
+
+  if (neg_exp)
+    {
+      if (one_over)
+	{
+	  if (n > 0)
+	    res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+					   fract_res, integer_res);
+	  else
+	    res = fract_res;
+
+	  res = build_and_insert_binop (gsi, loc, "powrootrecip", RDIV_EXPR,
+					  build_real (type, dconst1), res);
+	}
+      else
+	{
+	  res = build_and_insert_binop (gsi, loc, "powroot", RDIV_EXPR,
+					 fract_res, integer_res);
+	}
+    }
+  else
+    res = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
+				   fract_res, integer_res);
+  return res;
+}
+
 /* ARG0 and ARG1 are the two arguments to a pow builtin call in GSI
    with location info LOC.  If possible, create an equivalent and
    less expensive sequence of statements prior to GSI, and return an
@@ -1157,13 +1509,17 @@ static tree
 gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc, 
 			   tree arg0, tree arg1)
 {
-  REAL_VALUE_TYPE c, cint, dconst1_4, dconst3_4, dconst1_3, dconst1_6;
+  REAL_VALUE_TYPE c, cint, dconst1_3, dconst1_4, dconst1_6;
   REAL_VALUE_TYPE c2, dconst3;
   HOST_WIDE_INT n;
-  tree type, sqrtfn, cbrtfn, sqrt_arg0, sqrt_sqrt, result, cbrt_x, powi_cbrt_x;
+  tree type, sqrtfn, cbrtfn, sqrt_arg0, result, cbrt_x, powi_cbrt_x;
   machine_mode mode;
+  bool speed_p = optimize_bb_for_speed_p (gsi_bb (*gsi));
   bool hw_sqrt_exists, c_is_int, c2_is_int;
 
+  dconst1_4 = dconst1;
+  SET_REAL_EXP (&dconst1_4, REAL_EXP (&dconst1_4) - 2);
+
   /* If the exponent isn't a constant, there's nothing of interest
      to be done.  */
   if (TREE_CODE (arg1) != REAL_CST)
@@ -1179,7 +1535,7 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
   if (c_is_int
       && ((n >= -1 && n <= 2)
 	  || (flag_unsafe_math_optimizations
-	      && optimize_bb_for_speed_p (gsi_bb (*gsi))
+	      && speed_p
 	      && powi_cost (n) <= POWI_MAX_MULTS)))
     return gimple_expand_builtin_powi (gsi, loc, arg0, n);
 
@@ -1196,49 +1552,8 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
       && !HONOR_SIGNED_ZEROS (mode))
     return build_and_insert_call (gsi, loc, sqrtfn, arg0);
 
-  /* Optimize pow(x,0.25) = sqrt(sqrt(x)).  Assume on most machines that
-     a builtin sqrt instruction is smaller than a call to pow with 0.25,
-     so do this optimization even if -Os.  Don't do this optimization
-     if we don't have a hardware sqrt insn.  */
-  dconst1_4 = dconst1;
-  SET_REAL_EXP (&dconst1_4, REAL_EXP (&dconst1_4) - 2);
   hw_sqrt_exists = optab_handler (sqrt_optab, mode) != CODE_FOR_nothing;
 
-  if (flag_unsafe_math_optimizations
-      && sqrtfn
-      && REAL_VALUES_EQUAL (c, dconst1_4)
-      && hw_sqrt_exists)
-    {
-      /* sqrt(x)  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
-
-      /* sqrt(sqrt(x))  */
-      return build_and_insert_call (gsi, loc, sqrtfn, sqrt_arg0);
-    }
-      
-  /* Optimize pow(x,0.75) = sqrt(x) * sqrt(sqrt(x)) unless we are
-     optimizing for space.  Don't do this optimization if we don't have
-     a hardware sqrt insn.  */
-  real_from_integer (&dconst3_4, VOIDmode, 3, SIGNED);
-  SET_REAL_EXP (&dconst3_4, REAL_EXP (&dconst3_4) - 2);
-
-  if (flag_unsafe_math_optimizations
-      && sqrtfn
-      && optimize_function_for_speed_p (cfun)
-      && REAL_VALUES_EQUAL (c, dconst3_4)
-      && hw_sqrt_exists)
-    {
-      /* sqrt(x)  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
-
-      /* sqrt(sqrt(x))  */
-      sqrt_sqrt = build_and_insert_call (gsi, loc, sqrtfn, sqrt_arg0);
-
-      /* sqrt(x) * sqrt(sqrt(x))  */
-      return build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
-				     sqrt_arg0, sqrt_sqrt);
-    }
-
   /* Optimize pow(x,1./3.) = cbrt(x).  This requires unsafe math
      optimizations since 1./3. is not exactly representable.  If x
      is negative and finite, the correct value of pow(x,1./3.) is
@@ -1263,7 +1578,7 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
       && sqrtfn
       && cbrtfn
       && (gimple_val_nonnegative_real_p (arg0) || !HONOR_NANS (mode))
-      && optimize_function_for_speed_p (cfun)
+      && speed_p
       && hw_sqrt_exists
       && REAL_VALUES_EQUAL (c, dconst1_6))
     {
@@ -1274,54 +1589,31 @@ gimple_expand_builtin_pow (gimple_stmt_iterator *gsi, location_t loc,
       return build_and_insert_call (gsi, loc, cbrtfn, sqrt_arg0);
     }
 
-  /* Optimize pow(x,c), where n = 2c for some nonzero integer n
-     and c not an integer, into
-
-       sqrt(x) * powi(x, n/2),                n > 0;
-       1.0 / (sqrt(x) * powi(x, abs(n/2))),   n < 0.
-
-     Do not calculate the powi factor when n/2 = 0.  */
-  real_arithmetic (&c2, MULT_EXPR, &c, &dconst2);
-  n = real_to_integer (&c2);
-  real_from_integer (&cint, VOIDmode, n, SIGNED);
-  c2_is_int = real_identical (&c2, &cint);
 
+  /* Attempt to expand the POW as a product of square root chains.
+     Expand the 0.25 case even when otpimising for size.  */
   if (flag_unsafe_math_optimizations
       && sqrtfn
-      && c2_is_int
-      && !c_is_int
-      && optimize_function_for_speed_p (cfun))
+      && hw_sqrt_exists
+      && (speed_p || REAL_VALUES_EQUAL (c, dconst1_4))
+      && !HONOR_SIGNED_ZEROS (mode))
     {
-      tree powi_x_ndiv2 = NULL_TREE;
-
-      /* Attempt to fold powi(arg0, abs(n/2)) into multiplies.  If not
-         possible or profitable, give up.  Skip the degenerate case when
-         n is 1 or -1, where the result is always 1.  */
-      if (absu_hwi (n) != 1)
-	{
-	  powi_x_ndiv2 = gimple_expand_builtin_powi (gsi, loc, arg0,
-						     abs_hwi (n / 2));
-	  if (!powi_x_ndiv2)
-	    return NULL_TREE;
-	}
+      unsigned int max_depth = speed_p
+				? PARAM_VALUE (PARAM_MAX_POW_SQRT_DEPTH)
+				: 2;
 
-      /* Calculate sqrt(x).  When n is not 1 or -1, multiply it by the
-	 result of the optimal multiply sequence just calculated.  */
-      sqrt_arg0 = build_and_insert_call (gsi, loc, sqrtfn, arg0);
+      tree expand_with_sqrts
+	= expand_pow_as_sqrts (gsi, loc, arg0, arg1, max_depth);
 
-      if (absu_hwi (n) == 1)
-	result = sqrt_arg0;
-      else
-	result = build_and_insert_binop (gsi, loc, "powroot", MULT_EXPR,
-					 sqrt_arg0, powi_x_ndiv2);
-
-      /* If n is negative, reciprocate the result.  */
-      if (n < 0)
-	result = build_and_insert_binop (gsi, loc, "powroot", RDIV_EXPR,
-					 build_real (type, dconst1), result);
-      return result;
+      if (expand_with_sqrts)
+	return expand_with_sqrts;
     }
 
+  real_arithmetic (&c2, MULT_EXPR, &c, &dconst2);
+  n = real_to_integer (&c2);
+  real_from_integer (&cint, VOIDmode, n, SIGNED);
+  c2_is_int = real_identical (&c2, &cint);
+
   /* Optimize pow(x,c), where 3c = n for some nonzero integer n, into
 
      powi(x, n/3) * powi(cbrt(x), n%3),                    n > 0;