From patchwork Mon Oct  4 20:19:19 2010
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michael Meissner <meissner@linux.vnet.ibm.com>
X-Patchwork-Id: 66830
Return-Path: 
 <gcc-patches-return-274510-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	by ozlabs.org (Postfix) with SMTP id DD815B70AF
	for <incoming@patchwork.ozlabs.org>;
	Wed,  6 Oct 2010 00:51:02 +1100 (EST)
Received: (qmail 13266 invoked by alias); 5 Oct 2010 13:51:01 -0000
Received: (qmail 13256 invoked by uid 22791); 5 Oct 2010 13:50:58 -0000
X-SWARE-Spam-Status: No, hits=-0.8 required=5.0	tests=AWL, BAYES_00,
	DATE_IN_PAST_12_24, NO_DNS_FOR_FROM, TW_FN, T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from e32.co.us.ibm.com (HELO e32.co.us.ibm.com) (32.97.110.150) by
	sourceware.org (qpsmtpd/0.43rc1) with ESMTP;
	Tue, 05 Oct 2010 13:50:50 +0000
Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com
	[9.17.195.106])	by e32.co.us.ibm.com (8.14.4/8.13.1) with
	ESMTP id o95Dg09d022470	for <gcc-patches@gcc.gnu.org>;
	Tue, 5 Oct 2010 07:42:00 -0600
Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com
	[9.17.195.170])	by d03relay04.boulder.ibm.com
	(8.13.8/8.13.8/NCO v10.0) with ESMTP id o95DoaDr110312	for
	<gcc-patches@gcc.gnu.org>; Tue, 5 Oct 2010 07:50:37 -0600
Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1])	by
	d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with
	ESMTP id o95DoagB030668	for <gcc-patches@gcc.gnu.org>;
	Tue, 5 Oct 2010 07:50:36 -0600
Received: from hungry-tiger.westford.ibm.com (dyn9033037152.westford.ibm.com
	[9.33.37.152])	by d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO
	v10.0 AVin) with ESMTP id o95DoZfe030550;
	Tue, 5 Oct 2010 07:50:36 -0600
Received: by hungry-tiger.westford.ibm.com (Postfix, from userid 500)	id
	57BC6F86CC; Mon,  4 Oct 2010 16:19:19 -0400 (EDT)
Date: Mon, 4 Oct 2010 16:19:19 -0400
From: Michael Meissner <meissner@linux.vnet.ibm.com>
To: gcc-patches@gcc.gnu.org, dje.gcc@gmail.com
Subject: [PATCH], RFC Add support for C99 fma/fmas/fmal builtins
Message-ID: <20101004201919.GA22525@hungry-tiger.westford.ibm.com>
Mail-Followup-To: Michael Meissner <meissner@linux.vnet.ibm.com>,
	gcc-patches@gcc.gnu.org, dje.gcc@gmail.com
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org

I happen to reread parts of the C99 standard, and noticed the FMA builtins.
The current compiler defines fma{,f,l} builtins but does not provide any code
to expand the builtin.

The enclosed patch adds the basic FMA support to the compiler, and adds the
support to the powerpc backend.  In doing a quick grep of the MD files, it
looks like the following ports may support a combined multiply/add operation:
      arm
      frv
      i386
      ia64
      mips
      mmix
      pa
      390
      sh
      spu
      v850
      xtensa

On the powerpc, I made the fma{,s} builtin always generate the the FMADD and
FMADDS instructions even if the user used -mno-fused-madd to suppress the
automatic merging of add/multiply operations, since that is in the intent of
the builtin to provide access to the merged instruction (on most hardware with
a FMA instruction, there is no merge step between the multiply and add, so the
result can be subtly different).

I will leave it to the port maintainers to implment FMA on their machines if
they wish or not if these patches go in.

My question is what should we do if the port has no FMA instruction?

   1)	Always call fma external (current behavior)
   2)	Expand to (operand[0] * operand[1]) + operand[2] always
   3)	Expand to (operand[0] * operand[1]) + operand[2] if -ffast-math

I don't know what the few users of this builtin would expect as the default
behavior.  I tend to think that #3 might be the best, in that often times with
-ffast-math, implementations that have a FMA instruciton would generate it.

A second issue is should we provide macros that say the port has an appropriate
FMA instruction.  The C99 standard says that the following macros should be
defined in math.h if a fast fma implementation is provided:

	FP_FAST_FMA
	FP_FAST_FMAF
	FP_FAST_FMAL

I would imagine that we should provide __FAST_FMA__, __FAST_FMAF__, and
__FAST_FMAL__ macros that the library math.h file could test and define
FP_FAST_FMA* if desired.  Do people have an opinion on this?

Thanks in advance.

Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	(revision 164946)
+++ gcc/doc/md.texi	(working copy)
@@ -3948,6 +3948,16 @@ means of constraints requiring operands 
 @itemx @samp{and@var{m}3}, @samp{ior@var{m}3}, @samp{xor@var{m}3}
 Similar, for other arithmetic operations.
 
+@cindex @code{fma@var{m}4} instruction pattern
+@item @samp{fma@var{m}4}
+Multiply operand 2 and operand 1, then add operand 3, storing the
+result in operand 0.  All operands must have mode @var{m}.  This
+pattern is used to implement the @code{fma}, @code{fmas}, and
+@code{fmal} builtin functions from the ISO C99 standard.  The
+@code{fma} operation may produce different results than doing the
+multiply followed by the add if the machine does not perform a
+rounding step between the operations.
+
 @cindex @code{min@var{m}3} instruction pattern
 @cindex @code{max@var{m}3} instruction pattern
 @item @samp{smin@var{m}3}, @samp{smax@var{m}3}
Index: gcc/optabs.h
===================================================================
--- gcc/optabs.h	(revision 164946)
+++ gcc/optabs.h	(working copy)
@@ -190,6 +190,8 @@ enum optab_index
   OTI_pow,
   /* Arc tangent of y/x */
   OTI_atan2,
+  /* Floating multiply/add */
+  OTI_fma,
 
   /* Move instruction.  */
   OTI_mov,
@@ -432,6 +434,7 @@ enum optab_index
 #define umax_optab (&optab_table[OTI_umax])
 #define pow_optab (&optab_table[OTI_pow])
 #define atan2_optab (&optab_table[OTI_atan2])
+#define fma_optab (&optab_table[OTI_fma])
 
 #define mov_optab (&optab_table[OTI_mov])
 #define movstrict_optab (&optab_table[OTI_movstrict])
Index: gcc/genopinit.c
===================================================================
--- gcc/genopinit.c	(revision 164946)
+++ gcc/genopinit.c	(working copy)
@@ -159,6 +159,7 @@ static const char * const optabs[] =
   "set_optab_handler (sqrt_optab, $A, CODE_FOR_$(sqrt$a2$))",
   "set_optab_handler (floor_optab, $A, CODE_FOR_$(floor$a2$))",
   "set_convert_optab_handler (lfloor_optab, $B, $A, CODE_FOR_$(lfloor$F$a$I$b2$))",
+  "set_optab_handler (fma_optab, $A, CODE_FOR_$(fma$a4$))",
   "set_optab_handler (ceil_optab, $A, CODE_FOR_$(ceil$a2$))",
   "set_convert_optab_handler (lceil_optab, $B, $A, CODE_FOR_$(lceil$F$a$I$b2$))",
   "set_optab_handler (round_optab, $A, CODE_FOR_$(round$a2$))",
Index: gcc/builtins.c
===================================================================
--- gcc/builtins.c	(revision 164946)
+++ gcc/builtins.c	(working copy)
@@ -106,6 +106,7 @@ static void expand_errno_check (tree, rt
 static rtx expand_builtin_mathfn (tree, rtx, rtx);
 static rtx expand_builtin_mathfn_2 (tree, rtx, rtx);
 static rtx expand_builtin_mathfn_3 (tree, rtx, rtx);
+static rtx expand_builtin_mathfn_ternary (tree, rtx, rtx);
 static rtx expand_builtin_interclass_mathfn (tree, rtx);
 static rtx expand_builtin_sincos (tree);
 static rtx expand_builtin_cexpi (tree, rtx);
@@ -2185,6 +2186,79 @@ expand_builtin_mathfn_2 (tree exp, rtx t
   return target;
 }
 
+/* Expand a call to the builtin trinary math functions (fma).
+   Return NULL_RTX if a normal call should be emitted rather than expanding the
+   function in-line.  EXP is the expression that is a call to the builtin
+   function; if convenient, the result should be placed in TARGET.
+   SUBTARGET may be used as the target for computing one of EXP's
+   operands.  */
+
+static rtx
+expand_builtin_mathfn_ternary (tree exp, rtx target, rtx subtarget)
+{
+  optab builtin_optab;
+  rtx op0, op1, op2, insns;
+  tree fndecl = get_callee_fndecl (exp);
+  tree arg0, arg1, arg2;
+  enum machine_mode mode;
+
+  if (!validate_arglist (exp, REAL_TYPE, REAL_TYPE, REAL_TYPE, VOID_TYPE))
+    return NULL_RTX;
+
+  arg0 = CALL_EXPR_ARG (exp, 0);
+  arg1 = CALL_EXPR_ARG (exp, 1);
+  arg2 = CALL_EXPR_ARG (exp, 2);
+
+  switch (DECL_FUNCTION_CODE (fndecl))
+    {
+    CASE_FLT_FN (BUILT_IN_FMA):
+      builtin_optab = fma_optab; break;
+    default:
+      gcc_unreachable ();
+    }
+
+  /* Make a suitable register to place result in.  */
+  mode = TYPE_MODE (TREE_TYPE (exp));
+
+  /* Before working hard, check whether the instruction is available.  */
+  if (optab_handler (builtin_optab, mode) == CODE_FOR_nothing)
+    return NULL_RTX;
+
+  target = gen_reg_rtx (mode);
+
+  /* Always stabilize the argument list.  */
+  CALL_EXPR_ARG (exp, 0) = arg0 = builtin_save_expr (arg0);
+  CALL_EXPR_ARG (exp, 1) = arg1 = builtin_save_expr (arg1);
+  CALL_EXPR_ARG (exp, 2) = arg2 = builtin_save_expr (arg2);
+
+  op0 = expand_expr (arg0, subtarget, VOIDmode, EXPAND_NORMAL);
+  op1 = expand_normal (arg1);
+  op2 = expand_normal (arg2);
+
+  start_sequence ();
+
+  /* Compute into TARGET.
+     Set TARGET to wherever the result comes back.  */
+  target = expand_ternary_op (mode, builtin_optab, op0, op1, op2,
+			      target, 0);
+
+  /* If we were unable to expand via the builtin, stop the sequence
+     (without outputting the insns) and call to the library function
+     with the stabilized argument list.  */
+  if (target == 0)
+    {
+      end_sequence ();
+      return expand_call (exp, target, target == const0_rtx);
+    }
+
+  /* Output the entire sequence.  */
+  insns = get_insns ();
+  end_sequence ();
+  emit_insn (insns);
+
+  return target;
+}
+
 /* Expand a call to the builtin sin and cos math functions.
    Return NULL_RTX if a normal call should be emitted rather than expanding the
    function in-line.  EXP is the expression that is a call to the builtin
@@ -5829,6 +5903,12 @@ expand_builtin (tree exp, rtx target, rt
 	return target;
       break;
 
+    CASE_FLT_FN (BUILT_IN_FMA):
+      target = expand_builtin_mathfn_ternary (exp, target, subtarget);
+      if (target)
+	return target;
+      break;
+
     CASE_FLT_FN (BUILT_IN_ILOGB):
       if (! flag_unsafe_math_optimizations)
 	break;
Index: gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c	(revision 0)
@@ -0,0 +1,84 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O3 -ftree-vectorize -mcpu=power7 -ffast-math -mno-fused-madd" } */
+/* { dg-final { scan-assembler-times "xvmadd" 2 } } */
+/* { dg-final { scan-assembler-times "xsmadd" 1 } } */
+/* { dg-final { scan-assembler-times "fmadds" 1 } } */
+
+/* Only the functions calling the bulitin should generate an appropriate (a *
+   b) + c instruction.  */
+
+double
+builtin_fma (double b, double c, double d)
+{
+  return __builtin_fma (b, c, d);
+}
+
+float
+builtin_fmaf (float b, float c, float d)
+{
+  return __builtin_fmaf (b, c, d);
+}
+
+double
+normal_fma (double b, double c, double d)
+{
+  return (b * c) + d;
+}
+
+float
+normal_fmaf (float b, float c, float d)
+{
+  return (b * c) + d;
+}
+
+#ifndef SIZE
+#define SIZE 1024
+#endif
+
+double vda[SIZE] __attribute__((__aligned__(32)));
+double vdb[SIZE] __attribute__((__aligned__(32)));
+double vdc[SIZE] __attribute__((__aligned__(32)));
+double vdd[SIZE] __attribute__((__aligned__(32)));
+
+float vfa[SIZE] __attribute__((__aligned__(32)));
+float vfb[SIZE] __attribute__((__aligned__(32)));
+float vfc[SIZE] __attribute__((__aligned__(32)));
+float vfd[SIZE] __attribute__((__aligned__(32)));
+
+void
+vector_fma (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vda[i] = __builtin_fma (vdb[i], vdc[i], vdd[i]);
+}
+
+void
+vector_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = __builtin_fmaf (vfb[i], vfc[i], vfd[i]);
+}
+
+void
+vnormal_fma (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vda[i] = (vdb[i] * vdc[i]) + vdd[i];
+}
+
+void
+vnormal_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = (vfb[i] * vfc[i]) + vfd[i];
+}
Index: gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c	(revision 0)
@@ -0,0 +1,61 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_altivec_ok } */
+/* { dg-options "-O3 -ftree-vectorize -mcpu=power6 -maltivec -ffast-math" } */
+/* { dg-final { scan-assembler-times "vmaddfp" 2 } } */
+/* { dg-final { scan-assembler-times "fmadd " 2 } } */
+/* { dg-final { scan-assembler-times "fmadds" 2 } } */
+
+/* All functions should generate an appropriate (a * b) + c instruction
+   since -mfused-madd is on by default.  */
+
+double
+builtin_fma (double b, double c, double d)
+{
+  return __builtin_fma (b, c, d);
+}
+
+float
+builtin_fmaf (float b, float c, float d)
+{
+  return __builtin_fmaf (b, c, d);
+}
+
+double
+normal_fma (double b, double c, double d)
+{
+  return (b * c) + d;
+}
+
+float
+normal_fmaf (float b, float c, float d)
+{
+  return (b * c) + d;
+}
+
+#ifndef SIZE
+#define SIZE 1024
+#endif
+
+float vfa[SIZE] __attribute__((__aligned__(32)));
+float vfb[SIZE] __attribute__((__aligned__(32)));
+float vfc[SIZE] __attribute__((__aligned__(32)));
+float vfd[SIZE] __attribute__((__aligned__(32)));
+
+void
+vector_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = __builtin_fmaf (vfb[i], vfc[i], vfd[i]);
+}
+
+void
+vnormal_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = (vfb[i] * vfc[i]) + vfd[i];
+}
Index: gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c	(revision 0)
@@ -0,0 +1,61 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_altivec_ok } */
+/* { dg-options "-O3 -ftree-vectorize -mcpu=power6 -maltivec -ffast-math -mno-fused-madd" } */
+/* { dg-final { scan-assembler-times "vmaddfp" 1 } } */
+/* { dg-final { scan-assembler-times "fmadd " 1 } } */
+/* { dg-final { scan-assembler-times "fmadds" 1 } } */
+
+/* Only the functions calling the builtin should generate an appropriate
+   (a * b) + c instruction.  */
+
+double
+builtin_fma (double b, double c, double d)
+{
+  return __builtin_fma (b, c, d);
+}
+
+float
+builtin_fmaf (float b, float c, float d)
+{
+  return __builtin_fmaf (b, c, d);
+}
+
+double
+normal_fma (double b, double c, double d)
+{
+  return (b * c) + d;
+}
+
+float
+normal_fmaf (float b, float c, float d)
+{
+  return (b * c) + d;
+}
+
+#ifndef SIZE
+#define SIZE 1024
+#endif
+
+float vfa[SIZE] __attribute__((__aligned__(32)));
+float vfb[SIZE] __attribute__((__aligned__(32)));
+float vfc[SIZE] __attribute__((__aligned__(32)));
+float vfd[SIZE] __attribute__((__aligned__(32)));
+
+void
+vector_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = __builtin_fmaf (vfb[i], vfc[i], vfd[i]);
+}
+
+void
+vnormal_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = (vfb[i] * vfc[i]) + vfd[i];
+}
Index: gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c	(revision 0)
@@ -0,0 +1,84 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O3 -ftree-vectorize -mcpu=power7 -ffast-math" } */
+/* { dg-final { scan-assembler-times "xvmadd" 4 } } */
+/* { dg-final { scan-assembler-times "xsmadd" 2 } } */
+/* { dg-final { scan-assembler-times "fmadds" 2 } } */
+
+/* All functions should generate an appropriate (a * b) + c instruction
+   since -mfused-madd is on by default.  */
+
+double
+builtin_fma (double b, double c, double d)
+{
+  return __builtin_fma (b, c, d);
+}
+
+float
+builtin_fmaf (float b, float c, float d)
+{
+  return __builtin_fmaf (b, c, d);
+}
+
+double
+normal_fma (double b, double c, double d)
+{
+  return (b * c) + d;
+}
+
+float
+normal_fmaf (float b, float c, float d)
+{
+  return (b * c) + d;
+}
+
+#ifndef SIZE
+#define SIZE 1024
+#endif
+
+double vda[SIZE] __attribute__((__aligned__(32)));
+double vdb[SIZE] __attribute__((__aligned__(32)));
+double vdc[SIZE] __attribute__((__aligned__(32)));
+double vdd[SIZE] __attribute__((__aligned__(32)));
+
+float vfa[SIZE] __attribute__((__aligned__(32)));
+float vfb[SIZE] __attribute__((__aligned__(32)));
+float vfc[SIZE] __attribute__((__aligned__(32)));
+float vfd[SIZE] __attribute__((__aligned__(32)));
+
+void
+vector_fma (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vda[i] = __builtin_fma (vdb[i], vdc[i], vdd[i]);
+}
+
+void
+vector_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = __builtin_fmaf (vfb[i], vfc[i], vfd[i]);
+}
+
+void
+vnormal_fma (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vda[i] = (vdb[i] * vdc[i]) + vdd[i];
+}
+
+void
+vnormal_fmaf (void)
+{
+  int i;
+
+  for (i = 0; i < SIZE; i++)
+    vfa[i] = (vfb[i] * vfc[i]) + vfd[i];
+}
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 164946)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -1357,10 +1357,10 @@ vectorizable_call (gimple stmt, gimple_s
   vectype_in = NULL_TREE;
   nargs = gimple_call_num_args (stmt);
 
-  /* Bail out if the function has more than two arguments, we
-     do not have interesting builtin functions to vectorize with
-     more than two arguments.  No arguments is also not good.  */
-  if (nargs == 0 || nargs > 2)
+  /* Bail out if the function has more than three arguments, we do not have
+     interesting builtin functions to vectorize with more than two arguments
+     except for fma.  No arguments is also not good.  */
+  if (nargs == 0 || nargs > 3)
     return false;
 
   for (i = 0; i < nargs; i++)
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 164946)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -3895,6 +3895,22 @@ rs6000_builtin_vectorized_function (tree
 	  if (VECTOR_UNIT_ALTIVEC_P (V4SFmode))
 	    return rs6000_builtin_decls[ALTIVEC_BUILTIN_VRFIM];
 	  break;
+	case BUILT_IN_FMA:
+	  if (VECTOR_UNIT_VSX_P (V2DFmode)
+	      && out_mode == DFmode && out_n == 2
+	      && in_mode == DFmode && in_n == 2)
+	    return rs6000_builtin_decls[VSX_BUILTIN_XVMADDDP];
+	  break;
+	case BUILT_IN_FMAF:
+	  if (VECTOR_UNIT_VSX_P (V4SFmode)
+	      && out_mode == SFmode && out_n == 4
+	      && in_mode == SFmode && in_n == 4)
+	    return rs6000_builtin_decls[VSX_BUILTIN_XVMADDSP];
+	  else if (VECTOR_UNIT_ALTIVEC_P (V4SFmode)
+	      && out_mode == SFmode && out_n == 4
+	      && in_mode == SFmode && in_n == 4)
+	    return rs6000_builtin_decls[ALTIVEC_BUILTIN_VMADDFP];
+	  break;
 	case BUILT_IN_TRUNC:
 	  if (VECTOR_UNIT_VSX_P (V2DFmode)
 	      && out_mode == DFmode && out_n == 2
Index: gcc/config/rs6000/rs6000.md
===================================================================
--- gcc/config/rs6000/rs6000.md	(revision 164946)
+++ gcc/config/rs6000/rs6000.md	(working copy)
@@ -110,6 +110,7 @@ (define_constants
    (UNSPEC_LFIWAX		56)
    (UNSPEC_LFIWZX		57)
    (UNSPEC_FCTIWUZ		58)
+   (UNSPEC_FMA			59)
   ])
 
 ;;
@@ -5844,6 +5845,39 @@ (define_insn "fres"
   "fres %0,%1"
   [(set_attr "type" "fp")])
 
+(define_expand "fmasf4"
+  [(set (match_operand:SF 0 "gpc_reg_operand" "")
+	(plus:SF (mult:SF (match_operand:SF 1 "gpc_reg_operand" "")
+			  (match_operand:SF 2 "gpc_reg_operand" ""))
+		 (match_operand:SF 3 "gpc_reg_operand" "")))]
+  "TARGET_HARD_FLOAT && TARGET_FPRS
+   && ((TARGET_POWERPC && TARGET_SINGLE_FLOAT)
+       || ! TARGET_POWERPC)"
+{
+  if (!TARGET_FUSED_MADD)
+    {
+      if (TARGET_POWERPC)
+	emit_insn (gen_fmasf4_powerpc (operands[0], operands[1], operands[2],
+				       operands[3]));
+      else
+	emit_insn (gen_fmasf4_power (operands[0], operands[1], operands[2],
+				     operands[3]));
+      DONE;
+    }
+})
+
+; Fma builtin that will work even if -mno-fused-madd is used.
+(define_insn "fmasf4_powerpc"
+  [(set (match_operand:SF 0 "gpc_reg_operand" "=f")
+	(unspec:SF [(match_operand:SF 1 "gpc_reg_operand" "f")
+		    (match_operand:SF 2 "gpc_reg_operand" "f")
+		    (match_operand:SF 3 "gpc_reg_operand" "f")]
+		   UNSPEC_FMA))]
+  "TARGET_HARD_FLOAT && TARGET_FPRS && TARGET_POWERPC"
+  "fmadds %0,%1,%2,%3"
+  [(set_attr "type" "fp")
+   (set_attr "fp_type" "fp_maddsub_s")])
+  
 (define_insn "*fmaddsf4_powerpc"
   [(set (match_operand:SF 0 "gpc_reg_operand" "=f")
 	(plus:SF (mult:SF (match_operand:SF 1 "gpc_reg_operand" "%f")
@@ -5855,6 +5889,16 @@ (define_insn "*fmaddsf4_powerpc"
   [(set_attr "type" "fp")
    (set_attr "fp_type" "fp_maddsub_s")])
 
+(define_insn "fmasf4_power"
+  [(set (match_operand:SF 0 "gpc_reg_operand" "=f")
+	(unspec:SF [(match_operand:SF 1 "gpc_reg_operand" "f")
+		    (match_operand:SF 2 "gpc_reg_operand" "f")
+		    (match_operand:SF 3 "gpc_reg_operand" "f")]
+		   UNSPEC_FMA))]
+  "TARGET_HARD_FLOAT && TARGET_FPRS && !TARGET_POWERPC"
+  "{fma|fmadd} %0,%1,%2,%3"
+  [(set_attr "type" "dmul")])
+  
 (define_insn "*fmaddsf4_power"
   [(set (match_operand:SF 0 "gpc_reg_operand" "=f")
 	(plus:SF (mult:SF (match_operand:SF 1 "gpc_reg_operand" "%f")
@@ -6280,6 +6324,40 @@ (define_insn "*rsqrtdf_internal1"
   "frsqrte %0,%1"
   [(set_attr "type" "fp")])
 
+(define_expand "fmadf4"
+  [(set (match_operand:DF 0 "gpc_reg_operand" "")
+	(plus:DF (mult:DF (match_operand:DF 1 "gpc_reg_operand" "")
+			  (match_operand:DF 2 "gpc_reg_operand" ""))
+		 (match_operand:DF 3 "gpc_reg_operand" "")))]
+  "TARGET_HARD_FLOAT && TARGET_FPRS
+   && ((TARGET_POWERPC && TARGET_SINGLE_FLOAT)
+       || ! TARGET_POWERPC)"
+{
+  if (!TARGET_FUSED_MADD)
+    {
+      if (VECTOR_UNIT_VSX_P (DFmode))
+	emit_insn (gen_vsx_fmadddf4_2 (operands[0], operands[1], operands[2],
+				       operands[3]));
+      else
+	emit_insn (gen_fmadf4_fpr (operands[0], operands[1], operands[2],
+				   operands[3]));
+      DONE;
+    }
+})
+
+; Fma builtin that will work even if -mno-fused-madd is used.
+(define_insn "fmadf4_fpr"
+  [(set (match_operand:DF 0 "gpc_reg_operand" "=d")
+	(unspec:DF [(match_operand:DF 1 "gpc_reg_operand" "d")
+		    (match_operand:DF 2 "gpc_reg_operand" "d")
+		    (match_operand:DF 3 "gpc_reg_operand" "d")]
+		   UNSPEC_FMA))]
+  "TARGET_HARD_FLOAT && TARGET_FPRS && TARGET_DOUBLE_FLOAT
+   && VECTOR_UNIT_NONE_P (DFmode)"
+  "{fma|fmadd} %0,%1,%2,%3"
+  [(set_attr "type" "dmul")
+   (set_attr "fp_type" "fp_maddsub_d")])
+
 (define_insn "*fmadddf4_fpr"
   [(set (match_operand:DF 0 "gpc_reg_operand" "=d")
 	(plus:DF (mult:DF (match_operand:DF 1 "gpc_reg_operand" "%d")