From patchwork Mon May 22 18:32:44 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michael Meissner <meissner@linux.vnet.ibm.com>
X-Patchwork-Id: 765539
Return-Path: 
 <gcc-patches-return-454187-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3wWnN11gcWz9s2s
	for <incoming@patchwork.ozlabs.org>;
	Tue, 23 May 2017 04:33:05 +1000 (AEST)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="RzHQqvrE"; dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:date
	:from:to:subject:mime-version:content-type:message-id; q=dns; s=
	default; b=Ys1jnJisd6lU1RhjKMGmLZMDFr+3ypX3i5wWB2M0WmdBZ+9f8o4KL
	oH0Op9FeQAksQdpzRbHbrlJ667ntira7Jx2WGyWaLGjvOHzdFx/n9kmUVLPTHrUl
	3xu2as/C1h6EBm6fZXGmT/D8UZ6zgEYgBj8VAWneoMSNRNfqzat9ko=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:date
	:from:to:subject:mime-version:content-type:message-id; s=
	default; bh=QBHRGrOZX2qfnt8FhD8gDSqQlmg=; b=RzHQqvrESk3SbuEo7huY
	vm3uVyAXbm20t4Xk4smBgG1uekwpvrHfYv/rwvVaUCfCYOSB75U6umxhniQ0kIWs
	E8xMDXm74ILyHBDkmKcJlLGk761FFlAaMP2uUdtE442dbCVy33K5iAsHG4OwNu8t
	839eTizvqhpoDIxGaMTzgm4=
Received: (qmail 97859 invoked by alias); 22 May 2017 18:32:52 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 97434 invoked by uid 89); 22 May 2017 18:32:50 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-9.8 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
	KAM_LAZY_DOMAIN_SECURITY,
	RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=H*r:500,
	king
X-HELO: mx0a-001b2d01.pphosted.com
Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com)
	(148.163.158.5) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 22 May 2017 18:32:47 +0000
Received: from pps.filterd (m0098416.ppops.net [127.0.0.1])	by
	mx0b-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id
	v4MINhTn190636	for <gcc-patches@gcc.gnu.org>;
	Mon, 22 May 2017 14:32:49 -0400
Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.151])	by
	mx0b-001b2d01.pphosted.com with ESMTP id
	2am4t2sd71-1	(version=TLSv1.2 cipher=AES256-SHA bits=256
	verify=NOT)	for <gcc-patches@gcc.gnu.org>;
	Mon, 22 May 2017 14:32:49 -0400
Received: from localhost	by e33.co.us.ibm.com with IBM ESMTP SMTP Gateway:
	Authorized Use Only! Violators will be prosecuted	for
	<gcc-patches@gcc.gnu.org> from
	<meissner@ibm-tiger.the-meissners.org>;
	Mon, 22 May 2017 12:32:48 -0600
Received: from b03cxnp07028.gho.boulder.ibm.com (9.17.130.15)	by
	e33.co.us.ibm.com (192.168.1.133) with IBM ESMTP SMTP
	Gateway: Authorized Use Only! Violators will be prosecuted;
	Mon, 22 May 2017 12:32:46 -0600
Received: from b03ledav004.gho.boulder.ibm.com
	(b03ledav004.gho.boulder.ibm.com [9.17.130.235])	by
	b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0)
	with ESMTP id v4MIWkBO16187692; Mon, 22 May 2017 11:32:46 -0700
Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1])	by
	IMSVA (Postfix) with ESMTP id F0F4378037;
	Mon, 22 May 2017 12:32:45 -0600 (MDT)
Received: from ibm-tiger.the-meissners.org (unknown [9.32.77.111])	by
	b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP id
	C30137803F; Mon, 22 May 2017 12:32:45 -0600 (MDT)
Received: by ibm-tiger.the-meissners.org (Postfix, from userid 500)	id
	E0E314775E; Mon, 22 May 2017 14:32:44 -0400 (EDT)
Date: Mon, 22 May 2017 14:32:44 -0400
From: Michael Meissner <meissner@linux.vnet.ibm.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>
Subject: [PATCH], PR target/80718, Improve PowerPC splat double word
Mail-Followup-To: Michael Meissner <meissner@linux.vnet.ibm.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.5.20 (2009-12-10)
X-TM-AS-GCONF: 00
x-cbid: 17052218-0008-0000-0000-000007DA1762
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00007103; HX=3.00000241; KW=3.00000007;
	PH=3.00000004; SC=3.00000212; SDB=6.00864111; UDB=6.00428928;
	IPR=6.00643891; BA=6.00005367; NDR=6.00000001; ZLA=6.00000005;
	ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000;
	ZU=6.00000002; MB=3.00015543; XFM=3.00000015;
	UTC=2017-05-22 18:32:47
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17052218-0009-0000-0000-000042438F68
Message-Id: <20170522183244.GA22334@ibm-tiger.the-meissners.org>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2017-05-22_09:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
	spamscore=0 suspectscore=0 malwarescore=0 phishscore=0
	adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx
	scancount=1 engine=8.0.1-1703280000
	definitions=main-1705220097
X-IsSubscribed: yes

When I was comparing spec 2006 numbers between GCC 6.3 and 7.1, there was one
benchmark that was noticeably slower (milc).  In looking at the code generated,
the #1 hot function (mult_adj_su3_mat_vec) had some cases where automatic
vectorization generated splat of double from memory.

The register allocator did not use the load with splat instruction (LXVDSX)
because all of the loads were register+offset.  For the scalar values that it
could load into the FPR registers, it used the normal register+offset load
(d-form).  For the other scalar values that would wind up in the traditional
Altivec registers, the register allocator decided to load up the value into a
GPR register and do a direct move.

Now, it turns out that while the above code is inefficient, it is not a cause
for slow down of the milc benchmark.  However there might be other places where
using a load, direct move, and double word permute are causing a performance
problem, so I made this patch.

The patch splits the splat into a register splat and a memory splat.  This
forces the register allocator to convert the load to the indexed form which the
LXVDSX instruction uses.  I did a spec 2006 run with these changes, and there
were no significant performance differences with this patch.

In the mult_adj_su3_mat_vec function, there were previously 5 GPR loads, direct
move, and permute sequences along with one LXVDSK.  With this patch, those GPR
loads have been replaced with LXVDSKs.

Can I apply this patch to the trunk, and later apply it to the GCC 7 and 6
branches?

[gcc]
2017-05-22  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/80718
	* config/rs6000/vsx.md (vsx_splat_<mode>, VSX_D iterator): Split
	V2DF/V2DI splat into two separate patterns, one that handles
	registers, and the other that only handles memory.  Drop support
	for splatting from a GPR on ISA 2.07 and then splitting the
	splat into direct move and splat.
	(vsx_splat_<mode>_reg): Likewise.
	(vsx_splat_<mode>_mem): Likewise.

[gcc/testsuite]
2017-05-22  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/80718
	* gcc.target/powerpc/pr80718.c: New test.

Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(revision 248266)
+++ gcc/config/rs6000/vsx.md	(working copy)
@@ -3066,29 +3066,36 @@ (define_expand "vsx_mergeh_<mode>"
 })
 
 ;; V2DF/V2DI splat
-(define_insn_and_split "vsx_splat_<mode>"
-  [(set (match_operand:VSX_D 0 "vsx_register_operand"
-					"=<VSa>,    <VSa>,we,<VS_64dm>")
+(define_expand "vsx_splat_<mode>"
+  [(set (match_operand:VSX_D 0 "vsx_register_operand")
 	(vec_duplicate:VSX_D
-	 (match_operand:<VS_scalar> 1 "splat_input_operand"
-					"<VS_64reg>,Z,    b, wA")))]
+	 (match_operand:<VS_scalar> 1 "input_operand")))]
+  "VECTOR_MEM_VSX_P (<MODE>mode)"
+{
+  rtx op1 = operands[1];
+  if (MEM_P (op1))
+    operands[1] = rs6000_address_for_fpconvert (op1);
+  else if (!REG_P (op1))
+    op1 = force_reg (<VSX_D:VS_scalar>mode, op1);
+})
+
+(define_insn "vsx_splat_<mode>_reg"
+  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=<VSX_D:VSa>,?we")
+	(vec_duplicate:VSX_D
+	 (match_operand:<VS_scalar> 1 "gpc_reg_operand" "<VSX_D:VS_64reg>,b")))]
   "VECTOR_MEM_VSX_P (<MODE>mode)"
   "@
    xxpermdi %x0,%x1,%x1,0
-   lxvdsx %x0,%y1
-   mtvsrdd %x0,%1,%1
-   #"
-  "&& reload_completed && TARGET_POWERPC64 && !TARGET_P9_VECTOR
-   && int_reg_operand (operands[1], <VS_scalar>mode)"
-  [(set (match_dup 2)
-	(match_dup 1))
-   (set (match_dup 0)
-	(vec_duplicate:VSX_D (match_dup 2)))]
-{
-  operands[2] = gen_rtx_REG (<VS_scalar>mode, reg_or_subregno (operands[0]));
-}
-  [(set_attr "type" "vecperm,vecload,vecperm,vecperm")
-   (set_attr "length" "4,4,4,8")])
+   mtvsrdd %x0,%1,%1"
+  [(set_attr "type" "vecperm")])
+
+(define_insn "vsx_splat_<VSX_D:mode>_mem"
+  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=<VSX_D:VSa>")
+	(vec_duplicate:VSX_D
+	 (match_operand:<VSX_D:VS_scalar> 1 "memory_operand" "Z")))]
+  "VECTOR_MEM_VSX_P (<MODE>mode)"
+  "lxvdsx %x0,%y1"
+  [(set_attr "type" "vecload")])
 
 ;; V4SI splat support
 (define_insn "vsx_splat_v4si"
Index: gcc/testsuite/gcc.target/powerpc/pr80718.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/pr80718.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80718.c	(revision 0)
@@ -0,0 +1,298 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-mcpu=power8 -O3 -ffast-math" } */
+
+/* Taken from the Spec 2006 milc brenchmark.  Ultimately, GCC wants to generate
+   a DF splat from offsettable memory.  The register allocator decided it was
+   better to do the load in the GPR registers and do a move direct, rather than
+   doing a load in the VSX register sets.  */
+
+typedef struct
+{
+  double real;
+  double imag;
+} complex;
+
+typedef struct
+{
+  double real;
+  double imag;
+} double_complex;
+
+complex cmplx (double x, double y);
+complex cadd (complex * a, complex * b);
+complex cmul (complex * a, complex * b);
+complex csub (complex * a, complex * b);
+complex cdiv (complex * a, complex * b);
+complex conjg (complex * a);
+complex ce_itheta (double theta);
+
+double_complex dcmplx (double x, double y);
+double_complex dcadd (double_complex * a, double_complex * b);
+double_complex dcmul (double_complex * a, double_complex * b);
+double_complex dcsub (double_complex * a, double_complex * b);
+double_complex dcdiv (double_complex * a, double_complex * b);
+double_complex dconjg (double_complex * a);
+double_complex dcexp (double_complex * a);
+double_complex dclog (double_complex * a);
+double_complex dcsqrt (double_complex * z);
+double_complex dce_itheta (double theta);
+
+typedef struct
+{
+  unsigned long r0, r1, r2, r3, r4, r5, r6;
+  unsigned long multiplier, addend, ic_state;
+  double scale;
+} double_prn;
+
+double myrand (double_prn * prn_pt);
+
+typedef struct
+{
+  complex e[3][3];
+} su3_matrix;
+
+typedef struct
+{
+  complex c[3];
+} su3_vector;
+
+typedef struct
+{
+  complex m01, m02, m12;
+  double m00im, m11im, m22im;
+  double space;
+} anti_hermitmat;
+
+typedef struct
+{
+  complex e[2][2];
+} su2_matrix;
+typedef struct
+{
+  su3_vector d[4];
+} wilson_vector;
+typedef struct
+{
+  su3_vector h[2];
+} half_wilson_vector;
+typedef struct
+{
+  wilson_vector c[3];
+} color_wilson_vector;
+typedef struct
+{
+  wilson_vector d[4];
+} spin_wilson_vector;
+typedef struct
+{
+  color_wilson_vector d[4];
+} wilson_matrix;
+typedef struct
+{
+  spin_wilson_vector c[3];
+} wilson_propagator;
+
+void mult_su3_nn (su3_matrix * a, su3_matrix * b, su3_matrix * c);
+void mult_su3_na (su3_matrix * a, su3_matrix * b, su3_matrix * c);
+void mult_su3_an (su3_matrix * a, su3_matrix * b, su3_matrix * c);
+double realtrace_su3 (su3_matrix * a, su3_matrix * b);
+complex trace_su3 (su3_matrix * a);
+complex complextrace_su3 (su3_matrix * a, su3_matrix * b);
+complex det_su3 (su3_matrix * a);
+void add_su3_matrix (su3_matrix * a, su3_matrix * b, su3_matrix * c);
+void sub_su3_matrix (su3_matrix * a, su3_matrix * b, su3_matrix * c);
+void scalar_mult_su3_matrix (su3_matrix * src, double scalar,
+			     su3_matrix * dest);
+void scalar_mult_add_su3_matrix (su3_matrix * src1, su3_matrix * src2,
+				 double scalar, su3_matrix * dest);
+void scalar_mult_sub_su3_matrix (su3_matrix * src1, su3_matrix * src2,
+				 double scalar, su3_matrix * dest);
+void c_scalar_mult_su3mat (su3_matrix * src, complex * scalar,
+			   su3_matrix * dest);
+void c_scalar_mult_add_su3mat (su3_matrix * src1, su3_matrix * src2,
+			       complex * scalar, su3_matrix * dest);
+void c_scalar_mult_sub_su3mat (su3_matrix * src1, su3_matrix * src2,
+			       complex * scalar, su3_matrix * dest);
+void su3_adjoint (su3_matrix * a, su3_matrix * b);
+void make_anti_hermitian (su3_matrix * m3, anti_hermitmat * ah3);
+void random_anti_hermitian (anti_hermitmat * mat_antihermit,
+			    double_prn * prn_pt);
+void uncompress_anti_hermitian (anti_hermitmat * mat_anti, su3_matrix * mat);
+void compress_anti_hermitian (su3_matrix * mat, anti_hermitmat * mat_anti);
+void clear_su3mat (su3_matrix * dest);
+void su3mat_copy (su3_matrix * a, su3_matrix * b);
+void dumpmat (su3_matrix * m);
+
+void su3_projector (su3_vector * a, su3_vector * b, su3_matrix * c);
+complex su3_dot (su3_vector * a, su3_vector * b);
+double su3_rdot (su3_vector * a, su3_vector * b);
+double magsq_su3vec (su3_vector * a);
+void su3vec_copy (su3_vector * a, su3_vector * b);
+void dumpvec (su3_vector * v);
+void clearvec (su3_vector * v);
+
+void mult_su3_mat_vec (su3_matrix * a, su3_vector * b, su3_vector * c);
+void mult_su3_mat_vec_sum (su3_matrix * a, su3_vector * b, su3_vector * c);
+void mult_su3_mat_vec_sum_4dir (su3_matrix * a, su3_vector * b0,
+				su3_vector * b1, su3_vector * b2,
+				su3_vector * b3, su3_vector * c);
+void mult_su3_mat_vec_nsum (su3_matrix * a, su3_vector * b, su3_vector * c);
+void mult_adj_su3_mat_vec (su3_matrix * a, su3_vector * b, su3_vector * c);
+void mult_adj_su3_mat_vec_4dir (su3_matrix * a, su3_vector * b,
+				su3_vector * c);
+void mult_adj_su3_mat_4vec (su3_matrix * mat, su3_vector * src,
+			    su3_vector * dest0, su3_vector * dest1,
+			    su3_vector * dest2, su3_vector * dest3);
+void mult_adj_su3_mat_vec_sum (su3_matrix * a, su3_vector * b,
+			       su3_vector * c);
+void mult_adj_su3_mat_vec_nsum (su3_matrix * a, su3_vector * b,
+				su3_vector * c);
+
+void add_su3_vector (su3_vector * a, su3_vector * b, su3_vector * c);
+void sub_su3_vector (su3_vector * a, su3_vector * b, su3_vector * c);
+void sub_four_su3_vecs (su3_vector * a, su3_vector * b1, su3_vector * b2,
+			su3_vector * b3, su3_vector * b4);
+
+void scalar_mult_su3_vector (su3_vector * src, double scalar,
+			     su3_vector * dest);
+void scalar_mult_add_su3_vector (su3_vector * src1, su3_vector * src2,
+				 double scalar, su3_vector * dest);
+void scalar_mult_sum_su3_vector (su3_vector * src1, su3_vector * src2,
+				 double scalar);
+void scalar_mult_sub_su3_vector (su3_vector * src1, su3_vector * src2,
+				 double scalar, su3_vector * dest);
+void scalar_mult_wvec (wilson_vector * src, double s, wilson_vector * dest);
+void scalar_mult_hwvec (half_wilson_vector * src, double s,
+			half_wilson_vector * dest);
+void scalar_mult_add_wvec (wilson_vector * src1, wilson_vector * src2,
+			   double scalar, wilson_vector * dest);
+void scalar_mult_addtm_wvec (wilson_vector * src1, wilson_vector * src2,
+			     double scalar, wilson_vector * dest);
+void c_scalar_mult_wvec (wilson_vector * src1, complex * phase,
+			 wilson_vector * dest);
+void c_scalar_mult_add_wvec (wilson_vector * src1, wilson_vector * src2,
+			     complex * phase, wilson_vector * dest);
+void c_scalar_mult_add_wvec2 (wilson_vector * src1, wilson_vector * src2,
+			      complex s, wilson_vector * dest);
+void c_scalar_mult_su3vec (su3_vector * src, complex * phase,
+			   su3_vector * dest);
+void c_scalar_mult_add_su3vec (su3_vector * v1, complex * phase,
+			       su3_vector * v2);
+void c_scalar_mult_sub_su3vec (su3_vector * v1, complex * phase,
+			       su3_vector * v2);
+
+void left_su2_hit_n (su2_matrix * u, int p, int q, su3_matrix * link);
+void right_su2_hit_a (su2_matrix * u, int p, int q, su3_matrix * link);
+void dumpsu2 (su2_matrix * u);
+void mult_su2_mat_vec_elem_n (su2_matrix * u, complex * x0, complex * x1);
+void mult_su2_mat_vec_elem_a (su2_matrix * u, complex * x0, complex * x1);
+
+void mult_mat_wilson_vec (su3_matrix * mat, wilson_vector * src,
+			  wilson_vector * dest);
+void mult_su3_mat_hwvec (su3_matrix * mat, half_wilson_vector * src,
+			 half_wilson_vector * dest);
+void mult_adj_mat_wilson_vec (su3_matrix * mat, wilson_vector * src,
+			      wilson_vector * dest);
+void mult_adj_su3_mat_hwvec (su3_matrix * mat, half_wilson_vector * src,
+			     half_wilson_vector * dest);
+
+void add_wilson_vector (wilson_vector * src1, wilson_vector * src2,
+			wilson_vector * dest);
+void sub_wilson_vector (wilson_vector * src1, wilson_vector * src2,
+			wilson_vector * dest);
+double magsq_wvec (wilson_vector * src);
+complex wvec_dot (wilson_vector * src1, wilson_vector * src2);
+complex wvec2_dot (wilson_vector * src1, wilson_vector * src2);
+double wvec_rdot (wilson_vector * a, wilson_vector * b);
+
+void wp_shrink (wilson_vector * src, half_wilson_vector * dest,
+		int dir, int sign);
+void wp_shrink_4dir (wilson_vector * a, half_wilson_vector * b1,
+		     half_wilson_vector * b2, half_wilson_vector * b3,
+		     half_wilson_vector * b4, int sign);
+void wp_grow (half_wilson_vector * src, wilson_vector * dest,
+	      int dir, int sign);
+void wp_grow_add (half_wilson_vector * src, wilson_vector * dest,
+		  int dir, int sign);
+void grow_add_four_wvecs (wilson_vector * a, half_wilson_vector * b1,
+			  half_wilson_vector * b2, half_wilson_vector * b3,
+			  half_wilson_vector * b4, int sign, int sum);
+void mult_by_gamma (wilson_vector * src, wilson_vector * dest, int dir);
+void mult_by_gamma_left (wilson_matrix * src, wilson_matrix * dest, int dir);
+void mult_by_gamma_right (wilson_matrix * src, wilson_matrix * dest, int dir);
+void mult_swv_by_gamma_l (spin_wilson_vector * src, spin_wilson_vector * dest,
+			  int dir);
+void mult_swv_by_gamma_r (spin_wilson_vector * src, spin_wilson_vector * dest,
+			  int dir);
+void su3_projector_w (wilson_vector * a, wilson_vector * b, su3_matrix * c);
+void clear_wvec (wilson_vector * dest);
+void copy_wvec (wilson_vector * src, wilson_vector * dest);
+void dump_wilson_vec (wilson_vector * src);
+
+double gaussian_rand_no (double_prn * prn_pt);
+typedef int int32type;
+typedef unsigned int u_int32type;
+void byterevn (int32type w[], int n);
+
+void
+mult_adj_su3_mat_vec (su3_matrix * a, su3_vector * b, su3_vector * c)
+{
+  int i;
+  register double t, ar, ai, br, bi, cr, ci;
+  for (i = 0; i < 3; i++)
+    {
+      ar = a->e[0][i].real;
+      ai = a->e[0][i].imag;
+
+      br = b->c[0].real;
+      bi = b->c[0].imag;
+
+      cr = ar * br;
+      t = ai * bi;
+      cr += t;
+
+      ci = ar * bi;
+      t = ai * br;
+      ci -= t;
+
+      ar = a->e[1][i].real;
+      ai = a->e[1][i].imag;
+
+      br = b->c[1].real;
+      bi = b->c[1].imag;
+
+      t = ar * br;
+      cr += t;
+      t = ai * bi;
+      cr += t;
+
+      t = ar * bi;
+      ci += t;
+      t = ai * br;
+      ci -= t;
+
+      ar = a->e[2][i].real;
+      ai = a->e[2][i].imag;
+
+      br = b->c[2].real;
+      bi = b->c[2].imag;
+
+      t = ar * br;
+      cr += t;
+      t = ai * bi;
+      cr += t;
+
+      t = ar * bi;
+      ci += t;
+      t = ai * br;
+      ci -= t;
+
+      c->c[i].real = cr;
+      c->c[i].imag = ci;
+    }
+}
+
+/* { dg-final { scan-assembler-not "mtvsrd" } } */