From patchwork Mon May 22 18:32:44 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michael Meissner X-Patchwork-Id: 765539 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3wWnN11gcWz9s2s for ; Tue, 23 May 2017 04:33:05 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="RzHQqvrE"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:mime-version:content-type:message-id; q=dns; s= default; b=Ys1jnJisd6lU1RhjKMGmLZMDFr+3ypX3i5wWB2M0WmdBZ+9f8o4KL oH0Op9FeQAksQdpzRbHbrlJ667ntira7Jx2WGyWaLGjvOHzdFx/n9kmUVLPTHrUl 3xu2as/C1h6EBm6fZXGmT/D8UZ6zgEYgBj8VAWneoMSNRNfqzat9ko= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:mime-version:content-type:message-id; s= default; bh=QBHRGrOZX2qfnt8FhD8gDSqQlmg=; b=RzHQqvrESk3SbuEo7huY vm3uVyAXbm20t4Xk4smBgG1uekwpvrHfYv/rwvVaUCfCYOSB75U6umxhniQ0kIWs E8xMDXm74ILyHBDkmKcJlLGk761FFlAaMP2uUdtE442dbCVy33K5iAsHG4OwNu8t 839eTizvqhpoDIxGaMTzgm4= Received: (qmail 97859 invoked by alias); 22 May 2017 18:32:52 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 97434 invoked by uid 89); 22 May 2017 18:32:50 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-9.8 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY, RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=H*r:500, king X-HELO: mx0a-001b2d01.pphosted.com Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.158.5) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 22 May 2017 18:32:47 +0000 Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v4MINhTn190636 for ; Mon, 22 May 2017 14:32:49 -0400 Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.151]) by mx0b-001b2d01.pphosted.com with ESMTP id 2am4t2sd71-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Mon, 22 May 2017 14:32:49 -0400 Received: from localhost by e33.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 22 May 2017 12:32:48 -0600 Received: from b03cxnp07028.gho.boulder.ibm.com (9.17.130.15) by e33.co.us.ibm.com (192.168.1.133) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 22 May 2017 12:32:46 -0600 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v4MIWkBO16187692; Mon, 22 May 2017 11:32:46 -0700 Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F0F4378037; Mon, 22 May 2017 12:32:45 -0600 (MDT) Received: from ibm-tiger.the-meissners.org (unknown [9.32.77.111]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP id C30137803F; Mon, 22 May 2017 12:32:45 -0600 (MDT) Received: by ibm-tiger.the-meissners.org (Postfix, from userid 500) id E0E314775E; Mon, 22 May 2017 14:32:44 -0400 (EDT) Date: Mon, 22 May 2017 14:32:44 -0400 From: Michael Meissner To: GCC Patches , Segher Boessenkool , David Edelsohn Subject: [PATCH], PR target/80718, Improve PowerPC splat double word Mail-Followup-To: Michael Meissner , GCC Patches , Segher Boessenkool , David Edelsohn MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-12-10) X-TM-AS-GCONF: 00 x-cbid: 17052218-0008-0000-0000-000007DA1762 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007103; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000212; SDB=6.00864111; UDB=6.00428928; IPR=6.00643891; BA=6.00005367; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015543; XFM=3.00000015; UTC=2017-05-22 18:32:47 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17052218-0009-0000-0000-000042438F68 Message-Id: <20170522183244.GA22334@ibm-tiger.the-meissners.org> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-05-22_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000 definitions=main-1705220097 X-IsSubscribed: yes When I was comparing spec 2006 numbers between GCC 6.3 and 7.1, there was one benchmark that was noticeably slower (milc). In looking at the code generated, the #1 hot function (mult_adj_su3_mat_vec) had some cases where automatic vectorization generated splat of double from memory. The register allocator did not use the load with splat instruction (LXVDSX) because all of the loads were register+offset. For the scalar values that it could load into the FPR registers, it used the normal register+offset load (d-form). For the other scalar values that would wind up in the traditional Altivec registers, the register allocator decided to load up the value into a GPR register and do a direct move. Now, it turns out that while the above code is inefficient, it is not a cause for slow down of the milc benchmark. However there might be other places where using a load, direct move, and double word permute are causing a performance problem, so I made this patch. The patch splits the splat into a register splat and a memory splat. This forces the register allocator to convert the load to the indexed form which the LXVDSX instruction uses. I did a spec 2006 run with these changes, and there were no significant performance differences with this patch. In the mult_adj_su3_mat_vec function, there were previously 5 GPR loads, direct move, and permute sequences along with one LXVDSK. With this patch, those GPR loads have been replaced with LXVDSKs. Can I apply this patch to the trunk, and later apply it to the GCC 7 and 6 branches? [gcc] 2017-05-22 Michael Meissner PR target/80718 * config/rs6000/vsx.md (vsx_splat_, VSX_D iterator): Split V2DF/V2DI splat into two separate patterns, one that handles registers, and the other that only handles memory. Drop support for splatting from a GPR on ISA 2.07 and then splitting the splat into direct move and splat. (vsx_splat__reg): Likewise. (vsx_splat__mem): Likewise. [gcc/testsuite] 2017-05-22 Michael Meissner PR target/80718 * gcc.target/powerpc/pr80718.c: New test. Index: gcc/config/rs6000/vsx.md =================================================================== --- gcc/config/rs6000/vsx.md (revision 248266) +++ gcc/config/rs6000/vsx.md (working copy) @@ -3066,29 +3066,36 @@ (define_expand "vsx_mergeh_" }) ;; V2DF/V2DI splat -(define_insn_and_split "vsx_splat_" - [(set (match_operand:VSX_D 0 "vsx_register_operand" - "=, ,we,") +(define_expand "vsx_splat_" + [(set (match_operand:VSX_D 0 "vsx_register_operand") (vec_duplicate:VSX_D - (match_operand: 1 "splat_input_operand" - ",Z, b, wA")))] + (match_operand: 1 "input_operand")))] + "VECTOR_MEM_VSX_P (mode)" +{ + rtx op1 = operands[1]; + if (MEM_P (op1)) + operands[1] = rs6000_address_for_fpconvert (op1); + else if (!REG_P (op1)) + op1 = force_reg (mode, op1); +}) + +(define_insn "vsx_splat__reg" + [(set (match_operand:VSX_D 0 "vsx_register_operand" "=,?we") + (vec_duplicate:VSX_D + (match_operand: 1 "gpc_reg_operand" ",b")))] "VECTOR_MEM_VSX_P (mode)" "@ xxpermdi %x0,%x1,%x1,0 - lxvdsx %x0,%y1 - mtvsrdd %x0,%1,%1 - #" - "&& reload_completed && TARGET_POWERPC64 && !TARGET_P9_VECTOR - && int_reg_operand (operands[1], mode)" - [(set (match_dup 2) - (match_dup 1)) - (set (match_dup 0) - (vec_duplicate:VSX_D (match_dup 2)))] -{ - operands[2] = gen_rtx_REG (mode, reg_or_subregno (operands[0])); -} - [(set_attr "type" "vecperm,vecload,vecperm,vecperm") - (set_attr "length" "4,4,4,8")]) + mtvsrdd %x0,%1,%1" + [(set_attr "type" "vecperm")]) + +(define_insn "vsx_splat__mem" + [(set (match_operand:VSX_D 0 "vsx_register_operand" "=") + (vec_duplicate:VSX_D + (match_operand: 1 "memory_operand" "Z")))] + "VECTOR_MEM_VSX_P (mode)" + "lxvdsx %x0,%y1" + [(set_attr "type" "vecload")]) ;; V4SI splat support (define_insn "vsx_splat_v4si" Index: gcc/testsuite/gcc.target/powerpc/pr80718.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/pr80718.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/pr80718.c (revision 0) @@ -0,0 +1,298 @@ +/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */ +/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */ +/* { dg-require-effective-target powerpc_p8vector_ok } */ +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */ +/* { dg-options "-mcpu=power8 -O3 -ffast-math" } */ + +/* Taken from the Spec 2006 milc brenchmark. Ultimately, GCC wants to generate + a DF splat from offsettable memory. The register allocator decided it was + better to do the load in the GPR registers and do a move direct, rather than + doing a load in the VSX register sets. */ + +typedef struct +{ + double real; + double imag; +} complex; + +typedef struct +{ + double real; + double imag; +} double_complex; + +complex cmplx (double x, double y); +complex cadd (complex * a, complex * b); +complex cmul (complex * a, complex * b); +complex csub (complex * a, complex * b); +complex cdiv (complex * a, complex * b); +complex conjg (complex * a); +complex ce_itheta (double theta); + +double_complex dcmplx (double x, double y); +double_complex dcadd (double_complex * a, double_complex * b); +double_complex dcmul (double_complex * a, double_complex * b); +double_complex dcsub (double_complex * a, double_complex * b); +double_complex dcdiv (double_complex * a, double_complex * b); +double_complex dconjg (double_complex * a); +double_complex dcexp (double_complex * a); +double_complex dclog (double_complex * a); +double_complex dcsqrt (double_complex * z); +double_complex dce_itheta (double theta); + +typedef struct +{ + unsigned long r0, r1, r2, r3, r4, r5, r6; + unsigned long multiplier, addend, ic_state; + double scale; +} double_prn; + +double myrand (double_prn * prn_pt); + +typedef struct +{ + complex e[3][3]; +} su3_matrix; + +typedef struct +{ + complex c[3]; +} su3_vector; + +typedef struct +{ + complex m01, m02, m12; + double m00im, m11im, m22im; + double space; +} anti_hermitmat; + +typedef struct +{ + complex e[2][2]; +} su2_matrix; +typedef struct +{ + su3_vector d[4]; +} wilson_vector; +typedef struct +{ + su3_vector h[2]; +} half_wilson_vector; +typedef struct +{ + wilson_vector c[3]; +} color_wilson_vector; +typedef struct +{ + wilson_vector d[4]; +} spin_wilson_vector; +typedef struct +{ + color_wilson_vector d[4]; +} wilson_matrix; +typedef struct +{ + spin_wilson_vector c[3]; +} wilson_propagator; + +void mult_su3_nn (su3_matrix * a, su3_matrix * b, su3_matrix * c); +void mult_su3_na (su3_matrix * a, su3_matrix * b, su3_matrix * c); +void mult_su3_an (su3_matrix * a, su3_matrix * b, su3_matrix * c); +double realtrace_su3 (su3_matrix * a, su3_matrix * b); +complex trace_su3 (su3_matrix * a); +complex complextrace_su3 (su3_matrix * a, su3_matrix * b); +complex det_su3 (su3_matrix * a); +void add_su3_matrix (su3_matrix * a, su3_matrix * b, su3_matrix * c); +void sub_su3_matrix (su3_matrix * a, su3_matrix * b, su3_matrix * c); +void scalar_mult_su3_matrix (su3_matrix * src, double scalar, + su3_matrix * dest); +void scalar_mult_add_su3_matrix (su3_matrix * src1, su3_matrix * src2, + double scalar, su3_matrix * dest); +void scalar_mult_sub_su3_matrix (su3_matrix * src1, su3_matrix * src2, + double scalar, su3_matrix * dest); +void c_scalar_mult_su3mat (su3_matrix * src, complex * scalar, + su3_matrix * dest); +void c_scalar_mult_add_su3mat (su3_matrix * src1, su3_matrix * src2, + complex * scalar, su3_matrix * dest); +void c_scalar_mult_sub_su3mat (su3_matrix * src1, su3_matrix * src2, + complex * scalar, su3_matrix * dest); +void su3_adjoint (su3_matrix * a, su3_matrix * b); +void make_anti_hermitian (su3_matrix * m3, anti_hermitmat * ah3); +void random_anti_hermitian (anti_hermitmat * mat_antihermit, + double_prn * prn_pt); +void uncompress_anti_hermitian (anti_hermitmat * mat_anti, su3_matrix * mat); +void compress_anti_hermitian (su3_matrix * mat, anti_hermitmat * mat_anti); +void clear_su3mat (su3_matrix * dest); +void su3mat_copy (su3_matrix * a, su3_matrix * b); +void dumpmat (su3_matrix * m); + +void su3_projector (su3_vector * a, su3_vector * b, su3_matrix * c); +complex su3_dot (su3_vector * a, su3_vector * b); +double su3_rdot (su3_vector * a, su3_vector * b); +double magsq_su3vec (su3_vector * a); +void su3vec_copy (su3_vector * a, su3_vector * b); +void dumpvec (su3_vector * v); +void clearvec (su3_vector * v); + +void mult_su3_mat_vec (su3_matrix * a, su3_vector * b, su3_vector * c); +void mult_su3_mat_vec_sum (su3_matrix * a, su3_vector * b, su3_vector * c); +void mult_su3_mat_vec_sum_4dir (su3_matrix * a, su3_vector * b0, + su3_vector * b1, su3_vector * b2, + su3_vector * b3, su3_vector * c); +void mult_su3_mat_vec_nsum (su3_matrix * a, su3_vector * b, su3_vector * c); +void mult_adj_su3_mat_vec (su3_matrix * a, su3_vector * b, su3_vector * c); +void mult_adj_su3_mat_vec_4dir (su3_matrix * a, su3_vector * b, + su3_vector * c); +void mult_adj_su3_mat_4vec (su3_matrix * mat, su3_vector * src, + su3_vector * dest0, su3_vector * dest1, + su3_vector * dest2, su3_vector * dest3); +void mult_adj_su3_mat_vec_sum (su3_matrix * a, su3_vector * b, + su3_vector * c); +void mult_adj_su3_mat_vec_nsum (su3_matrix * a, su3_vector * b, + su3_vector * c); + +void add_su3_vector (su3_vector * a, su3_vector * b, su3_vector * c); +void sub_su3_vector (su3_vector * a, su3_vector * b, su3_vector * c); +void sub_four_su3_vecs (su3_vector * a, su3_vector * b1, su3_vector * b2, + su3_vector * b3, su3_vector * b4); + +void scalar_mult_su3_vector (su3_vector * src, double scalar, + su3_vector * dest); +void scalar_mult_add_su3_vector (su3_vector * src1, su3_vector * src2, + double scalar, su3_vector * dest); +void scalar_mult_sum_su3_vector (su3_vector * src1, su3_vector * src2, + double scalar); +void scalar_mult_sub_su3_vector (su3_vector * src1, su3_vector * src2, + double scalar, su3_vector * dest); +void scalar_mult_wvec (wilson_vector * src, double s, wilson_vector * dest); +void scalar_mult_hwvec (half_wilson_vector * src, double s, + half_wilson_vector * dest); +void scalar_mult_add_wvec (wilson_vector * src1, wilson_vector * src2, + double scalar, wilson_vector * dest); +void scalar_mult_addtm_wvec (wilson_vector * src1, wilson_vector * src2, + double scalar, wilson_vector * dest); +void c_scalar_mult_wvec (wilson_vector * src1, complex * phase, + wilson_vector * dest); +void c_scalar_mult_add_wvec (wilson_vector * src1, wilson_vector * src2, + complex * phase, wilson_vector * dest); +void c_scalar_mult_add_wvec2 (wilson_vector * src1, wilson_vector * src2, + complex s, wilson_vector * dest); +void c_scalar_mult_su3vec (su3_vector * src, complex * phase, + su3_vector * dest); +void c_scalar_mult_add_su3vec (su3_vector * v1, complex * phase, + su3_vector * v2); +void c_scalar_mult_sub_su3vec (su3_vector * v1, complex * phase, + su3_vector * v2); + +void left_su2_hit_n (su2_matrix * u, int p, int q, su3_matrix * link); +void right_su2_hit_a (su2_matrix * u, int p, int q, su3_matrix * link); +void dumpsu2 (su2_matrix * u); +void mult_su2_mat_vec_elem_n (su2_matrix * u, complex * x0, complex * x1); +void mult_su2_mat_vec_elem_a (su2_matrix * u, complex * x0, complex * x1); + +void mult_mat_wilson_vec (su3_matrix * mat, wilson_vector * src, + wilson_vector * dest); +void mult_su3_mat_hwvec (su3_matrix * mat, half_wilson_vector * src, + half_wilson_vector * dest); +void mult_adj_mat_wilson_vec (su3_matrix * mat, wilson_vector * src, + wilson_vector * dest); +void mult_adj_su3_mat_hwvec (su3_matrix * mat, half_wilson_vector * src, + half_wilson_vector * dest); + +void add_wilson_vector (wilson_vector * src1, wilson_vector * src2, + wilson_vector * dest); +void sub_wilson_vector (wilson_vector * src1, wilson_vector * src2, + wilson_vector * dest); +double magsq_wvec (wilson_vector * src); +complex wvec_dot (wilson_vector * src1, wilson_vector * src2); +complex wvec2_dot (wilson_vector * src1, wilson_vector * src2); +double wvec_rdot (wilson_vector * a, wilson_vector * b); + +void wp_shrink (wilson_vector * src, half_wilson_vector * dest, + int dir, int sign); +void wp_shrink_4dir (wilson_vector * a, half_wilson_vector * b1, + half_wilson_vector * b2, half_wilson_vector * b3, + half_wilson_vector * b4, int sign); +void wp_grow (half_wilson_vector * src, wilson_vector * dest, + int dir, int sign); +void wp_grow_add (half_wilson_vector * src, wilson_vector * dest, + int dir, int sign); +void grow_add_four_wvecs (wilson_vector * a, half_wilson_vector * b1, + half_wilson_vector * b2, half_wilson_vector * b3, + half_wilson_vector * b4, int sign, int sum); +void mult_by_gamma (wilson_vector * src, wilson_vector * dest, int dir); +void mult_by_gamma_left (wilson_matrix * src, wilson_matrix * dest, int dir); +void mult_by_gamma_right (wilson_matrix * src, wilson_matrix * dest, int dir); +void mult_swv_by_gamma_l (spin_wilson_vector * src, spin_wilson_vector * dest, + int dir); +void mult_swv_by_gamma_r (spin_wilson_vector * src, spin_wilson_vector * dest, + int dir); +void su3_projector_w (wilson_vector * a, wilson_vector * b, su3_matrix * c); +void clear_wvec (wilson_vector * dest); +void copy_wvec (wilson_vector * src, wilson_vector * dest); +void dump_wilson_vec (wilson_vector * src); + +double gaussian_rand_no (double_prn * prn_pt); +typedef int int32type; +typedef unsigned int u_int32type; +void byterevn (int32type w[], int n); + +void +mult_adj_su3_mat_vec (su3_matrix * a, su3_vector * b, su3_vector * c) +{ + int i; + register double t, ar, ai, br, bi, cr, ci; + for (i = 0; i < 3; i++) + { + ar = a->e[0][i].real; + ai = a->e[0][i].imag; + + br = b->c[0].real; + bi = b->c[0].imag; + + cr = ar * br; + t = ai * bi; + cr += t; + + ci = ar * bi; + t = ai * br; + ci -= t; + + ar = a->e[1][i].real; + ai = a->e[1][i].imag; + + br = b->c[1].real; + bi = b->c[1].imag; + + t = ar * br; + cr += t; + t = ai * bi; + cr += t; + + t = ar * bi; + ci += t; + t = ai * br; + ci -= t; + + ar = a->e[2][i].real; + ai = a->e[2][i].imag; + + br = b->c[2].real; + bi = b->c[2].imag; + + t = ar * br; + cr += t; + t = ai * bi; + cr += t; + + t = ar * bi; + ci += t; + t = ai * br; + ci -= t; + + c->c[i].real = cr; + c->c[i].imag = ci; + } +} + +/* { dg-final { scan-assembler-not "mtvsrd" } } */