From patchwork Mon Jul 8 09:49:07 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: MAHESH BODAPATI X-Patchwork-Id: 1957876 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=ibm.com header.i=@ibm.com header.a=rsa-sha256 header.s=pp1 header.b=d1bz+lh/; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4WHfWM4Q7bz1xpd for ; Mon, 8 Jul 2024 19:50:39 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id AEEEB384F4B8 for ; Mon, 8 Jul 2024 09:50:37 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 6DAB03861009 for ; Mon, 8 Jul 2024 09:50:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6DAB03861009 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 6DAB03861009 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720432221; cv=none; b=u3zpUQZSU82OkUwzdLUS0B7CkrxfBTn4keq57xKP0qQvNk+y3nVXtqMD8PNJ6yk4cmOOU6lfGK0TJP7WRqLys1N7arP3Lb0JdpAcYIIS3rNe/EglPl5Ca3kpt5/VT2nN7jzw2HKHHNIGnp5YqY3o7XML2pToOn/BJaKu30fgHWo= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720432221; c=relaxed/simple; bh=FNFssDB+A3uwnG25IOT2upDDfCrFNFjdrNnT5KxdJz0=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=jT5WJNsTbmyUdq9HSZHC+rl/xz7fn9APzcemkfVNKK/t3Znvw+yvRrxDnb265L5twczdrpGP2TwfVsCznwpex+dT4c+d1oBSd2vPedUYQLzfeouEQ/t0MXNDmCowZGfyhfM2YXZFhYMr2E+dRssqASta9Rdmsg5iL+E+XhgHdyY= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 4684wRWA026684; Mon, 8 Jul 2024 09:50:17 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from :to:cc:subject:date:message-id:mime-version :content-transfer-encoding; s=pp1; bh=mrJ+32m76LqjjFr9rxNa55hQq+ ga4lXZHJGy0BFtVaI=; b=d1bz+lh/+IQBG6+Xd0k6L4j8aJiZb0oWmQiDEqBVIN 7MHx1H6/6ZOJPe0NxmaNB907zilD+MAvXWZ/zfJgb5c5roDgB6D0JDznw8EZn6Bd 6d0OB/5XQQEo8NRzcSbeLSsDHCb17mAyhtX5Gp9rUojcMgANDhaUgdY+r84UUWTz 8UG2I9lt3RRm16D3j1UYUAe03cnlMP4AbbPuf+096m99Dr7D4q5Xm5rr+idMgd7i uyoq5QiasX9dD17snOWD9MgVeOSJ8M9yeeRF35PZzrmI6IUWKWezXFJ3foysKu6n 8o5yDtzXRQZDaQbw0v3hKlTy+Zo0WxXFFBkibgkE04aA== Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 407ux92033-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Jul 2024 09:50:17 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 4688tGE8014081; Mon, 8 Jul 2024 09:50:16 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 407h8pebjj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 08 Jul 2024 09:50:15 +0000 Received: from smtpav06.fra02v.mail.ibm.com (smtpav06.fra02v.mail.ibm.com [10.20.54.105]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 4689oAVM54002032 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 8 Jul 2024 09:50:12 GMT Received: from smtpav06.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0C3DD2005A; Mon, 8 Jul 2024 09:50:10 +0000 (GMT) Received: from smtpav06.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 20E7320065; Mon, 8 Jul 2024 09:50:09 +0000 (GMT) Received: from rain48p1.aus.stglabs.ibm.com (unknown [9.40.203.147]) by smtpav06.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 8 Jul 2024 09:50:08 +0000 (GMT) From: MAHESH BODAPATI To: libc-alpha@sourceware.org Cc: bergner@linux.ibm.com, murphyp@linux.ibm.com, adhemerval.zanella@linaro.org, Mahesh Bodapati Subject: [PATCH v4] powerpc64: optimize strcpy and stpcpy for POWER9/10 Date: Mon, 8 Jul 2024 05:49:07 -0400 Message-ID: <20240708094906.1120118-2-bmahi496@linux.ibm.com> X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 7DTvsp8KKKyaRFtdThIeD2ji0jVIPKmO X-Proofpoint-GUID: 7DTvsp8KKKyaRFtdThIeD2ji0jVIPKmO X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.28.16 definitions=2024-07-08_04,2024-07-05_01,2024-05-17_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 phishscore=0 priorityscore=1501 lowpriorityscore=0 malwarescore=0 mlxscore=0 impostorscore=0 clxscore=1015 adultscore=0 mlxlogscore=927 suspectscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2406140001 definitions=main-2407080073 X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org From: Mahesh Bodapati This patch modifies the current POWER9 implementation of strcpy and stpcpy to optimize it for POWER9/10. Since no new POWER10 instructions are used, the original POWER9 strcpy is modified instead of creating a new implementation for POWER10. The changes also affect stpcpy, which uses the same implementation with some additional code before returning. Improvements compared to POWER9 version: Use simple comparisons for the first ~512 bytes The main loop is good for long strings, but comparing 16B each time is better for shorter strings. After aligning the address to 16 bytes, we unroll the loop four times, checking 128 bytes each time. There may be some overlap with the main loop for unaligned strings, but it is better for shorter strings. Loop with 64 bytes for longer bytes using 4 consecutive lxv/stxv instructions. Showed an average improvement of 13%. Reviewed-by: Paul E. Murphy --- Changes v3 -> v4: - Used a tab after the opcode. sysdeps/powerpc/powerpc64/le/power9/strcpy.S | 276 +++++++++++++++---- 1 file changed, 223 insertions(+), 53 deletions(-) diff --git a/sysdeps/powerpc/powerpc64/le/power9/strcpy.S b/sysdeps/powerpc/powerpc64/le/power9/strcpy.S index 603bde1e39..2f50625a19 100644 --- a/sysdeps/powerpc/powerpc64/le/power9/strcpy.S +++ b/sysdeps/powerpc/powerpc64/le/power9/strcpy.S @@ -42,22 +42,48 @@ if USE_AS_STPCPY is defined. - The implementation can load bytes past a null terminator, but only - up to the next 16B boundary, so it never crosses a page. */ + This implementation never reads across a page boundary, but may + read beyond the NUL terminator. */ -/* Load quadword at addr+offset to vreg, check for null bytes, +/* Load 4 quadwords, merge into one VR for speed and check for NUL + and branch to label if NUL is found. */ +#define CHECK_64B(offset,addr,label) \ + lxv 32+v4,(offset+0)(addr); \ + lxv 32+v5,(offset+16)(addr); \ + lxv 32+v6,(offset+32)(addr); \ + lxv 32+v7,(offset+48)(addr); \ + vminub v14,v4,v5; \ + vminub v15,v6,v7; \ + vminub v16,v14,v15; \ + vcmpequb. v0,v16,v18; \ + beq cr6,$+12; \ + li r7,offset; \ + b L(label); \ + stxv 32+v4,(offset+0)(r11); \ + stxv 32+v5,(offset+16)(r11); \ + stxv 32+v6,(offset+32)(r11); \ + stxv 32+v7,(offset+48)(r11) + +/* Load quadword at addr+offset to vreg, check for NUL bytes, and branch to label if any are found. */ -#define CHECK16(vreg,offset,addr,label) \ - lxv vreg+32,offset(addr); \ - vcmpequb. v6,vreg,v18; \ +#define CHECK_16B(vreg,offset,addr,label) \ + lxv vreg+32,offset(addr); \ + vcmpequb. v15,vreg,v18; \ bne cr6,L(label); -.machine power9 +/* Store vreg2 with length if NUL is found. */ +#define STORE_WITH_LEN(vreg1,vreg2,reg) \ + vctzlsbb r8,vreg1; \ + addi r9,r8,1; \ + sldi r9,r9,56; \ + stxvl 32+vreg2,reg,r9; + +.machine power9 ENTRY_TOCLESS (FUNC_NAME, 4) CALL_MCOUNT 2 - vspltisb v18,0 /* Zeroes in v18 */ - vspltisb v19,-1 /* 0xFF bytes in v19 */ + vspltisb v18,0 /* Zeroes in v18. */ + vspltisb v19,-1 /* 0xFF bytes in v19. */ /* Next 16B-aligned address. Prepare address for L(loop). */ addi r5,r4,16 @@ -70,14 +96,11 @@ ENTRY_TOCLESS (FUNC_NAME, 4) lvsr v1,0,r4 vperm v0,v19,v0,v1 - vcmpequb. v6,v0,v18 /* 0xff if byte is NULL, 0x00 otherwise */ + vcmpequb. v6,v0,v18 /* 0xff if byte is NUL, 0x00 otherwise. */ beq cr6,L(no_null) - /* There's a null byte. */ - vctzlsbb r8,v6 /* Number of trailing zeroes */ - addi r9,r8,1 /* Add null byte. */ - sldi r10,r9,56 /* stxvl wants size in top 8 bits. */ - stxvl 32+v0,r3,r10 /* Partial store */ + /* There's a NUL byte. */ + STORE_WITH_LEN(v6,v0,r3) #ifdef USE_AS_STPCPY /* stpcpy returns the dest address plus the size not counting the @@ -87,17 +110,22 @@ ENTRY_TOCLESS (FUNC_NAME, 4) blr L(no_null): - sldi r10,r8,56 /* stxvl wants size in top 8 bits */ - stxvl 32+v0,r3,r10 /* Partial store */ + sldi r10,r8,56 /* stxvl wants size in top 8 bits. */ + stxvl 32+v0,r3,r10 /* Partial store. */ +/* The main loop is optimized for longer strings(> 512 bytes), + so checking the first bytes in 16B chunks benefits shorter + strings a lot. */ .p2align 4 -L(loop): - CHECK16(v0,0,r5,tail1) - CHECK16(v1,16,r5,tail2) - CHECK16(v2,32,r5,tail3) - CHECK16(v3,48,r5,tail4) - CHECK16(v4,64,r5,tail5) - CHECK16(v5,80,r5,tail6) +L(aligned): + CHECK_16B(v0,0,r5,tail1) + CHECK_16B(v1,16,r5,tail2) + CHECK_16B(v2,32,r5,tail3) + CHECK_16B(v3,48,r5,tail4) + CHECK_16B(v4,64,r5,tail5) + CHECK_16B(v5,80,r5,tail6) + CHECK_16B(v6,96,r5,tail7) + CHECK_16B(v7,112,r5,tail8) stxv 32+v0,0(r11) stxv 32+v1,16(r11) @@ -105,21 +133,146 @@ L(loop): stxv 32+v3,48(r11) stxv 32+v4,64(r11) stxv 32+v5,80(r11) + stxv 32+v6,96(r11) + stxv 32+v7,112(r11) - addi r5,r5,96 - addi r11,r11,96 + addi r11,r11,128 + + CHECK_16B(v0,128,r5,tail1) + CHECK_16B(v1,128+16,r5,tail2) + CHECK_16B(v2,128+32,r5,tail3) + CHECK_16B(v3,128+48,r5,tail4) + CHECK_16B(v4,128+64,r5,tail5) + CHECK_16B(v5,128+80,r5,tail6) + CHECK_16B(v6,128+96,r5,tail7) + CHECK_16B(v7,128+112,r5,tail8) + + stxv 32+v0,0(r11) + stxv 32+v1,16(r11) + stxv 32+v2,32(r11) + stxv 32+v3,48(r11) + stxv 32+v4,64(r11) + stxv 32+v5,80(r11) + stxv 32+v6,96(r11) + stxv 32+v7,112(r11) + + addi r11,r11,128 + + CHECK_16B(v0,256,r5,tail1) + CHECK_16B(v1,256+16,r5,tail2) + CHECK_16B(v2,256+32,r5,tail3) + CHECK_16B(v3,256+48,r5,tail4) + CHECK_16B(v4,256+64,r5,tail5) + CHECK_16B(v5,256+80,r5,tail6) + CHECK_16B(v6,256+96,r5,tail7) + CHECK_16B(v7,256+112,r5,tail8) + + stxv 32+v0,0(r11) + stxv 32+v1,16(r11) + stxv 32+v2,32(r11) + stxv 32+v3,48(r11) + stxv 32+v4,64(r11) + stxv 32+v5,80(r11) + stxv 32+v6,96(r11) + stxv 32+v7,112(r11) + + addi r11,r11,128 + + CHECK_16B(v0,384,r5,tail1) + CHECK_16B(v1,384+16,r5,tail2) + CHECK_16B(v2,384+32,r5,tail3) + CHECK_16B(v3,384+48,r5,tail4) + CHECK_16B(v4,384+64,r5,tail5) + CHECK_16B(v5,384+80,r5,tail6) + CHECK_16B(v6,384+96,r5,tail7) + CHECK_16B(v7,384+112,r5,tail8) + + stxv 32+v0,0(r11) + stxv 32+v1,16(r11) + stxv 32+v2,32(r11) + stxv 32+v3,48(r11) + stxv 32+v4,64(r11) + stxv 32+v5,80(r11) + stxv 32+v6,96(r11) + stxv 32+v7,112(r11) + + /* Align src pointer down to a 64B boundary. */ + addi r5,r4,512 + clrrdi r5,r5,6 + subf r7,r4,r5 + add r11,r3,r7 + +/* Switch to a more aggressive approach checking 64B each time. */ + .p2align 5 +L(strcpy_loop): + CHECK_64B(0,r5,tail_64b) + CHECK_64B(64,r5,tail_64b) + CHECK_64B(128,r5,tail_64b) + CHECK_64B(192,r5,tail_64b) + + CHECK_64B(256,r5,tail_64b) + CHECK_64B(256+64,r5,tail_64b) + CHECK_64B(256+128,r5,tail_64b) + CHECK_64B(256+192,r5,tail_64b) + addi r5,r5,512 + addi r11,r11,512 + + b L(strcpy_loop) + + .p2align 5 +L(tail_64b): + /* OK, we found a NUL byte. Let's look for it in the current 64-byte + block and mark it in its corresponding VR. */ + add r11,r11,r7 + vcmpequb. v8,v4,v18 + beq cr6,L(no_null_16B) + /* There's a NUL byte. */ + STORE_WITH_LEN(v8,v4,r11) +#ifdef USE_AS_STPCPY + add r3,r11,r8 +#endif + blr + +L(no_null_16B): + stxv 32+v4,0(r11) + vcmpequb. v8,v5,v18 + beq cr6,L(no_null_32B) + /* There's a NUL byte. */ + addi r11,r11,16 + STORE_WITH_LEN(v8,v5,r11) +#ifdef USE_AS_STPCPY + add r3,r11,r8 +#endif + blr - b L(loop) +L(no_null_32B): + stxv 32+v5,16(r11) + vcmpequb. v8,v6,v18 + beq cr6,L(no_null_48B) + /* There's a NUL byte. */ + addi r11,r11,32 + STORE_WITH_LEN(v8,v6,r11) +#ifdef USE_AS_STPCPY + add r3,r11,r8 +#endif + blr + +L(no_null_48B): + stxv 32+v6,32(r11) + vcmpequb. v8,v7,v18; + /* There's a NUL byte. */ + addi r11,r11,48 + STORE_WITH_LEN(v8,v7,r11) +#ifdef USE_AS_STPCPY + add r3,r11,r8 +#endif + blr .p2align 4 L(tail1): - vctzlsbb r8,v6 /* Number of trailing zeroes */ - addi r9,r8,1 /* Add null terminator */ - sldi r9,r9,56 /* stxvl wants size in top 8 bits */ - stxvl 32+v0,r11,r9 /* Partial store */ + /* There's a NUL byte. */ + STORE_WITH_LEN(v15,v0,r11) #ifdef USE_AS_STPCPY - /* stpcpy returns the dest address plus the size not counting the - final '\0'. */ add r3,r11,r8 #endif blr @@ -127,11 +280,9 @@ L(tail1): .p2align 4 L(tail2): stxv 32+v0,0(r11) - vctzlsbb r8,v6 - addi r9,r8,1 - sldi r9,r9,56 + /* There's a NUL byte. */ addi r11,r11,16 - stxvl 32+v1,r11,r9 + STORE_WITH_LEN(v15,v1,r11) #ifdef USE_AS_STPCPY add r3,r11,r8 #endif @@ -141,11 +292,8 @@ L(tail2): L(tail3): stxv 32+v0,0(r11) stxv 32+v1,16(r11) - vctzlsbb r8,v6 - addi r9,r8,1 - sldi r9,r9,56 addi r11,r11,32 - stxvl 32+v2,r11,r9 + STORE_WITH_LEN(v15,v2,r11) #ifdef USE_AS_STPCPY add r3,r11,r8 #endif @@ -156,11 +304,8 @@ L(tail4): stxv 32+v0,0(r11) stxv 32+v1,16(r11) stxv 32+v2,32(r11) - vctzlsbb r8,v6 - addi r9,r8,1 - sldi r9,r9,56 addi r11,r11,48 - stxvl 32+v3,r11,r9 + STORE_WITH_LEN(v15,v3,r11) #ifdef USE_AS_STPCPY add r3,r11,r8 #endif @@ -172,11 +317,8 @@ L(tail5): stxv 32+v1,16(r11) stxv 32+v2,32(r11) stxv 32+v3,48(r11) - vctzlsbb r8,v6 - addi r9,r8,1 - sldi r9,r9,56 addi r11,r11,64 - stxvl 32+v4,r11,r9 + STORE_WITH_LEN(v15,v4,r11) #ifdef USE_AS_STPCPY add r3,r11,r8 #endif @@ -189,11 +331,39 @@ L(tail6): stxv 32+v2,32(r11) stxv 32+v3,48(r11) stxv 32+v4,64(r11) - vctzlsbb r8,v6 - addi r9,r8,1 - sldi r9,r9,56 addi r11,r11,80 - stxvl 32+v5,r11,r9 + STORE_WITH_LEN(v15,v5,r11) +#ifdef USE_AS_STPCPY + add r3,r11,r8 +#endif + blr + + .p2align 4 +L(tail7): + stxv 32+v0,0(r11) + stxv 32+v1,16(r11) + stxv 32+v2,32(r11) + stxv 32+v3,48(r11) + stxv 32+v4,64(r11) + stxv 32+v5,80(r11) + addi r11,r11,96 + STORE_WITH_LEN(v15,v6,r11) +#ifdef USE_AS_STPCPY + add r3,r11,r8 +#endif + blr + + .p2align 4 +L(tail8): + stxv 32+v0,0(r11) + stxv 32+v1,16(r11) + stxv 32+v2,32(r11) + stxv 32+v3,48(r11) + stxv 32+v4,64(r11) + stxv 32+v5,80(r11) + stxv 32+v6,96(r11) + addi r11,r11,112 + STORE_WITH_LEN(v15,v7,r11) #ifdef USE_AS_STPCPY add r3,r11,r8 #endif