From patchwork Sun Nov 17 01:11:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michael Meissner X-Patchwork-Id: 2012416 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=ibm.com header.i=@ibm.com header.a=rsa-sha256 header.s=pp1 header.b=mqSS3sLO; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XrXnZ1mBXz1xxN for ; Sun, 17 Nov 2024 12:13:22 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 605C93857007 for ; Sun, 17 Nov 2024 01:13:19 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 605C93857007 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=ibm.com header.i=@ibm.com header.a=rsa-sha256 header.s=pp1 header.b=mqSS3sLO X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 31F353857365 for ; Sun, 17 Nov 2024 01:11:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 31F353857365 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 31F353857365 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1731805909; cv=none; b=vaaOaxjFdBGzCQqBOwRrC9AICJroFpAxVOcj2GN0B4c1IMZrQaEfClJBO7dH7LRG2Wo1FsmCuuc6SgBP7TnodN1ecG6mVnrBU4HSdtdmHctXmZrLm/gIjFwQYJO1vchFD2TMGZWPCI/kl2SL9GWSnkC86MQhSQ/FT+fnbIyOF9U= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1731805909; c=relaxed/simple; bh=7le0vPqONPG6PQJYy/uwtsEDlaOiqQlcivQwfEk589s=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=xuqRK5/YFzAXJQLV8crgXTE4AQUwv4QIXyDBOEjyx5yrMw8NU3ifbdewrfE0f7VcoQ6Gk18iGM52+qDxrrX8fabryLy6LWHNhYQyEovceQv/s7UbPiemhp0vM41Av694q+eKE8k2sWZj6K8SJD/3lyx29REfwA7mibikXdVcbPs= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 31F353857365 Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 4AGNR7ZD010556; Sun, 17 Nov 2024 01:11:48 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h= content-type:date:from:message-id:mime-version:subject:to; s= pp1; bh=Tt4Juo/gDUEPuCuY6riCijzxAwYx6ZAwU0crh5m1zgI=; b=mqSS3sLO WEPdKn+LZnM1l9/2uMIuxqgJZOx/1+xXOuDOd/DqbUgqLJK4/pg10t/3eWxrjgIj 2HYlHk3wTAo1lQm07Jpt5ryu0tu6gxHd4RrghxCoQIszPHjexWqKk14EVpTIxy14 W3OrMYut6vW9SCJ8lYPk6wkL0u0DKQLOB43fLrZ5qIMASRe9xctgdg/T9Mm5ykDu Spu3X4pgD89N1po4VomwqzGEhwjY8ORu/42KLcEUqtbCoqLEG8wKvxUBuMLkm7kh reC3ZtUEAdA5HstWNgmCDvk4i83di4d3ySav1jkyYWEm20twOxbc8/haj6oJkAv2 qSk17Y7CBnOB6Q== Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 42xjw7kct1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 17 Nov 2024 01:11:47 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 4AH0bM3s030980; Sun, 17 Nov 2024 01:11:46 GMT Received: from smtprelay06.wdc07v.mail.ibm.com ([172.16.1.73]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 42y63xg8a1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 17 Nov 2024 01:11:46 +0000 Received: from smtpav06.wdc07v.mail.ibm.com (smtpav06.wdc07v.mail.ibm.com [10.39.53.233]) by smtprelay06.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 4AH1BjYa8848066 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sun, 17 Nov 2024 01:11:45 GMT Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 974AD58054; Sun, 17 Nov 2024 01:11:45 +0000 (GMT) Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 008BB5803F; Sun, 17 Nov 2024 01:11:45 +0000 (GMT) Received: from cowardly-lion.the-meissners.org (unknown [9.61.98.188]) by smtpav06.wdc07v.mail.ibm.com (Postfix) with ESMTPS; Sun, 17 Nov 2024 01:11:44 +0000 (GMT) Date: Sat, 16 Nov 2024 20:11:43 -0500 From: Michael Meissner To: gcc-patches@gcc.gnu.org, Michael Meissner , Segher Boessenkool , Peter Bergner Subject: [PATCH] PR target/117487 Add power9/power10 float to logical operations Message-ID: Mail-Followup-To: Michael Meissner , gcc-patches@gcc.gnu.org, Segher Boessenkool , Peter Bergner MIME-Version: 1.0 Content-Disposition: inline X-TM-AS-GCONF: 00 X-Proofpoint-GUID: bOT_oVAWamzZle51lnTn_-NqdTpCo28J X-Proofpoint-ORIG-GUID: bOT_oVAWamzZle51lnTn_-NqdTpCo28J X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1051,Hydra:6.0.680,FMLib:17.12.62.30 definitions=2024-10-15_01,2024-10-11_01,2024-09-30_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 spamscore=0 adultscore=0 priorityscore=1501 bulkscore=0 malwarescore=0 phishscore=0 lowpriorityscore=0 mlxscore=0 mlxlogscore=999 suspectscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2409260000 definitions=main-2411170007 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org I was answering an email from a co-worker and I pointed him to work I had done for the Power8 era that optimizes the 32-bit float math library in Glibc. In doing so, I discovered with the Power9 and later computers, this optimization is no longer taking place. The glibc 32-bit floating point math functions have code that looks like: union u { float f; uint32_t u32; }; float math_foo (float x, unsigned int mask) { union u arg; float x2; arg.f = x; arg.u32 &= mask; x2 = arg.f; /* ... */ } On power8 with the optimization it generates: xscvdpspn 0,1 sldi 9,4,32 mtvsrd 32,9 xxland 1,0,32 xscvspdpn 1,1 I.e., it converts the SFmode to the memory format (instead of the DFmode that is used within the register), converts the mask so that it is in the vector register in the upper 32-bits, and does a XXLAND (i.e. there is only one direct move from GPR to vector register). Then after doing this, it converts the upper 32-bits back to DFmode. If the XSCVSPDN instruction took the value in the normal 32-bit scalar in a vector register, we wouldn't have needed the SLDI of the mask. On power9/power10/power11 it currently generates: xscvdpspn 0,1 mfvsrwz 2,0 and 2,2,4 mtvsrws 1,2 xscvspdpn 1,1 blr I.e convert to SFmode representation, move the value to a GPR, do an AND operation, move the 32-bit value with a splat, and then convert it back to DFmode format. With this patch, it now generates: xscvdpspn 0,1 mtvsrwz 32,2 xxland 32,0,32 xxspltw 1,32,1 xscvspdpn 1,1 blr I.e. convert to SFmode representation, move the mask to the vector register, do the operation using XXLAND. Splat the value to get the value in the correct location, and then convert back to DFmode. I have built GCC with the patches in this patch set applied on both little and big endian PowerPC systems and there were no regressions. Can I apply this patch to GCC 15? 2024-11-16 Michael Meissner gcc/ PR target/117487 * config/rs6000/vsx.md (SFmode logical peephoole): Update comments in the original code that supports power8. Add a new define_peephole2 to do the optimization on power9/power10. --- gcc/config/rs6000/vsx.md | 142 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 137 insertions(+), 5 deletions(-) diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md index 73f20a86e56..4dd44499a72 100644 --- a/gcc/config/rs6000/vsx.md +++ b/gcc/config/rs6000/vsx.md @@ -6280,7 +6280,7 @@ (define_constants (SFBOOL_MFVSR_A 3) ;; move to gpr src (SFBOOL_BOOL_D 4) ;; and/ior/xor dest (SFBOOL_BOOL_A1 5) ;; and/ior/xor arg1 - (SFBOOL_BOOL_A2 6) ;; and/ior/xor arg1 + (SFBOOL_BOOL_A2 6) ;; and/ior/xor arg2 (SFBOOL_SHL_D 7) ;; shift left dest (SFBOOL_SHL_A 8) ;; shift left arg (SFBOOL_MTVSR_D 9) ;; move to vecter dest @@ -6320,18 +6320,18 @@ (define_constants ;; GPR, and instead move the integer mask value to the vector register after a ;; shift and do the VSX logical operation. -;; The insns for dealing with SFmode in GPR registers looks like: +;; The insns for dealing with SFmode in GPR registers looks like on power8: ;; (set (reg:V4SF reg2) (unspec:V4SF [(reg:SF reg1)] UNSPEC_VSX_CVDPSPN)) ;; -;; (set (reg:DI reg3) (unspec:DI [(reg:V4SF reg2)] UNSPEC_P8V_RELOAD_FROM_VSX)) +;; (set (reg:DI reg3) (zero_extend:DI (reg:SI reg2))) ;; -;; (set (reg:DI reg4) (and:DI (reg:DI reg3) (reg:DI reg3))) +;; (set (reg:DI reg4) (and:SI (reg:SI reg3) (reg:SI mask))) ;; ;; (set (reg:DI reg5) (ashift:DI (reg:DI reg4) (const_int 32))) ;; ;; (set (reg:SF reg6) (unspec:SF [(reg:DI reg5)] UNSPEC_P8V_MTVSRD)) ;; -;; (set (reg:SF reg6) (unspec:SF [(reg:SF reg6)] UNSPEC_VSX_CVSPDPN)) +;; (set (reg:SF reg7) (unspec:SF [(reg:SF reg6)] UNSPEC_VSX_CVSPDPN)) (define_peephole2 [(match_scratch:DI SFBOOL_TMP_GPR "r") @@ -6412,6 +6412,138 @@ (define_peephole2 operands[SFBOOL_MTVSR_D_V4SF] = gen_rtx_REG (V4SFmode, regno_mtvsr_d); }) +;; Constants for SFbool optimization on power9/power10 +(define_constants + [(SFBOOL2_TMP_VSX_V4SI 0) ;; vector temporary (V4SI) + (SFBOOL2_TMP_GPR_SI 1) ;; GPR temporary (SI) + (SFBOOL2_MFVSR_D 2) ;; move to gpr dest (DI) + (SFBOOL2_MFVSR_A 3) ;; move to gpr src (SI) + (SFBOOL2_BOOL_D 4) ;; and/ior/xor dest (SI) + (SFBOOL2_BOOL_A1 5) ;; and/ior/xor arg1 (SI) + (SFBOOL2_BOOL_A2 6) ;; and/ior/xor arg2 (SI) + (SFBOOL2_SPLAT_D 7) ;; splat dest (V4SI) + (SFBOOL2_MTVSR_D 8) ;; move/splat to VSX dest. + (SFBOOL2_MTVSR_A 9) ;; move/splat to VSX arg. + (SFBOOL2_MFVSR_A_V4SI 10) ;; MFVSR_A as V4SI + (SFBOOL2_MTVSR_D_V4SI 11) ;; MTVSR_D as V4SI + (SFBOOL2_XXSPLTW 12)]) ;; 1 or 3 for XXSPLTW + +;; On power9/power10, the code is different because we have a splat 32-bit +;; operation that does a direct move to the FPR/vector registers (MTVSRWS). +;; +;; The insns for dealing with SFmode in GPR registers looks like on +;; power9/power10: +;; +;; (set (reg:V4SF reg2) (unspec:V4SF [(reg:SF reg1)] UNSPEC_VSX_CVDPSPN)) +;; +;; (set (reg:DI reg3) (zero_extend:DI (reg:SI reg2))) +;; +;; (set (reg:SI reg4) (and:SI (reg:SI reg3) (reg:SI mask))) +;; +;; (set (reg:V4SI reg5) (vec_duplicate:V4SI (reg:SI reg4))) +;; +;; (set (reg:SF reg6) (unspec:SF [(reg:SF reg5)] UNSPEC_VSX_CVSPDPN)) + +;; The VSX temporary needs to be an Altivec register in case we are trying to +;; do and/ior/xor of -16..15 and we want to use VSPLTISW to load the constant. +;; +;; The GPR temporary is only used if we are trying to do a logical operation +;; with a constant outside of the -16..15 range on a power9. Otherwise, we can +;; load the constant directly into the VSX temporary register. + +(define_peephole2 + [(match_scratch:V4SI SFBOOL2_TMP_VSX_V4SI "v") + (match_scratch:SI SFBOOL2_TMP_GPR_SI "r") + + ;; Zero_extend and direct move + (set (match_operand:DI SFBOOL2_MFVSR_D "int_reg_operand") + (zero_extend:DI + (match_operand:SI SFBOOL2_MFVSR_A "vsx_register_operand"))) + + ;; AND/IOR/XOR operation on int + (set (match_operand:SI SFBOOL2_BOOL_D "int_reg_operand") + (and_ior_xor:SI + (match_operand:SI SFBOOL2_BOOL_A1 "int_reg_operand") + (match_operand:SI SFBOOL2_BOOL_A2 "reg_or_cint_operand"))) + + ;; Splat sfbool result to vector register + (set (match_operand:V4SI SFBOOL2_SPLAT_D "vsx_register_operand") + (vec_duplicate:V4SI + (match_dup SFBOOL2_BOOL_D)))] + + "TARGET_POWERPC64 && TARGET_P9_VECTOR + && REG_P (operands[SFBOOL2_MFVSR_D]) + && REG_P (operands[SFBOOL2_BOOL_A1]) + && (REGNO (operands[SFBOOL2_MFVSR_D]) == REGNO (operands[SFBOOL2_BOOL_A1]) + || (REG_P (operands[SFBOOL2_BOOL_A2]) + && (REGNO (operands[SFBOOL2_MFVSR_D]) + == REGNO (operands[SFBOOL2_BOOL_A2])))) + && peep2_reg_dead_p (3, operands[SFBOOL2_MFVSR_D]) + && peep2_reg_dead_p (4, operands[SFBOOL2_BOOL_D])" + + ;; Either (set (reg:SI xxx) (reg:SI yyy)) or + ;; (set (reg:V4SI xxx) (const_vector (parallel [c, c, c, c]))) + [(set (match_dup SFBOOL2_MTVSR_D) + (match_dup SFBOOL2_MTVSR_A)) + + ;; And/ior/xor on vector registers + (set (match_dup SFBOOL2_TMP_VSX_V4SI) + (and_ior_xor:V4SI + (match_dup SFBOOL2_MFVSR_A_V4SI) + (match_dup SFBOOL2_TMP_VSX_V4SI))) + + ;; XXSPLTW t,r,r,1 + (set (match_dup SFBOOL2_SPLAT_D) + (vec_duplicate:V4SI + (vec_select:SI + (match_dup SFBOOL2_TMP_VSX_V4SI) + (parallel [(match_dup SFBOOL2_XXSPLTW)]))))] +{ + rtx mfvsr_d = operands[SFBOOL2_MFVSR_D]; + rtx bool_a1 = operands[SFBOOL2_BOOL_A1]; + rtx bool_a2 = operands[SFBOOL2_BOOL_A2]; + rtx bool_arg = (rtx_equal_p (mfvsr_d, bool_a1) ? bool_a2 : bool_a1); + int regno_mfvsr_a = REGNO (operands[SFBOOL2_MFVSR_A]); + int regno_tmp_vsx = REGNO (operands[SFBOOL2_TMP_VSX_V4SI]); + + /* If the logical operation is a constant, form the constant in a vector + register. */ + if (CONST_INT_P (bool_arg)) + { + HOST_WIDE_INT value = INTVAL (bool_arg); + + /* See if we can directly load the constant, either by VSPLTIW or by + XXSPLTIW on power10. */ + + if (IN_RANGE (value, -16, 15) || TARGET_PREFIXED) + { + rtvec cv = gen_rtvec (4, bool_arg, bool_arg, bool_arg, bool_arg); + operands[SFBOOL2_MTVSR_D] = gen_rtx_REG (V4SImode, regno_tmp_vsx); + operands[SFBOOL2_MTVSR_A] = gen_rtx_CONST_VECTOR (V4SImode, cv); + } + + else + { + /* We need to load up the constant to a GPR and move it to a + vector register. */ + rtx tmp_gpr = operands[SFBOOL2_TMP_GPR_SI]; + emit_move_insn (tmp_gpr, bool_arg); + operands[SFBOOL2_MTVSR_D] = gen_rtx_REG (SImode, regno_tmp_vsx); + operands[SFBOOL2_MTVSR_A] = tmp_gpr; + } + } + else + { + /* Mask is in a register, move it to a vector register. */ + operands[SFBOOL2_MTVSR_D] = gen_rtx_REG (SImode, regno_tmp_vsx); + operands[SFBOOL2_MTVSR_A] = bool_arg; + } + + operands[SFBOOL2_TMP_VSX_V4SI] = gen_rtx_REG (V4SImode, regno_tmp_vsx); + operands[SFBOOL2_MFVSR_A_V4SI] = gen_rtx_REG (V4SImode, regno_mfvsr_a); + operands[SFBOOL2_XXSPLTW] = GEN_INT (BYTES_BIG_ENDIAN ? 1 : 2); +}) + ;; Support signed/unsigned long long to float conversion vectorization. ;; Note that any_float (pc) here is just for code attribute . (define_expand "vec_pack_float_v2di"