Message ID | DB5PR08MB103042C88641FD9D3A04C3F983C40@DB5PR08MB1030.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [AArch64] Adjust writeback in non-zero memset | expand |
v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256. This fixes an ineffiency in the non-zero memset. Delaying the writeback until the end of the loop is slightly faster on some cores - this shows ~5% performance gain on Cortex-A53 when doing large non-zero memsets. Tested against the GLIBC testsuite, OK for commit? ChangeLog: 2018-11-14 Wilco Dijkstra <wdijkstr@arm.com> * sysdeps/aarch64/memset.S (MEMSET): Improve non-zero memset loop. --- diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S index 4a454593618f78e22c55520d56737fab5d8f63a4..9738cf5fd55a1d937fb3392cec46f37b4d5fb51d 100644 --- a/sysdeps/aarch64/memset.S +++ b/sysdeps/aarch64/memset.S @@ -89,10 +89,10 @@ L(set_long): b.eq L(try_zva) L(no_zva): sub count, dstend, dst /* Count is 16 too large. */ - add dst, dst, 16 + sub dst, dst, 16 /* Dst is biased by -32. */ sub count, count, 64 + 16 /* Adjust count and bias for loop. */ -1: stp q0, q0, [dst], 64 - stp q0, q0, [dst, -32] +1: stp q0, q0, [dst, 32] + stp q0, q0, [dst, 64]! L(tail64): subs count, count, 64 b.hi 1b @@ -183,6 +183,7 @@ L(zva_other): subs count, count, zva_len b.hs 3b 4: add count, count, zva_len + sub dst, dst, 32 /* Bias dst for tail loop. */ b L(tail64) #endif
On 14/11/18 8:15 PM, Wilco Dijkstra wrote: > v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256. > > This fixes an ineffiency in the non-zero memset. Delaying the writeback > until the end of the loop is slightly faster on some cores - this shows > ~5% performance gain on Cortex-A53 when doing large non-zero memsets. > > Tested against the GLIBC testsuite, OK for commit? Can you please also summarize the performance results for other processors? Thanks, Siddhesh
Hi Siddhesh, >On 14/11/18 8:15 PM, Wilco Dijkstra wrote: >> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256. >> >> This fixes an ineffiency in the non-zero memset. Delaying the writeback >> until the end of the loop is slightly faster on some cores - this shows >> ~5% performance gain on Cortex-A53 when doing large non-zero memsets. >> >> Tested against the GLIBC testsuite, OK for commit? > > Can you please also summarize the performance results for other processors? I ran it on Cortex-A72, and there isn't a performance difference on the memset bench beyond measurement errors. On any out-of-order core address increments are executed in parallel with loads/stores, ie. they have no measurable latency and so their exact placement is irrelevant. Wilco
On 16/11/18 10:34 PM, Wilco Dijkstra wrote: > I ran it on Cortex-A72, and there isn't a performance difference on the memset bench > beyond measurement errors. On any out-of-order core address increments are > executed in parallel with loads/stores, ie. they have no measurable latency and so > their exact placement is irrelevant. Thanks for the confirmation. Looks OK to me. Siddhesh
diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S index 4a454593618f78e22c55520d56737fab5d8f63a4..2eefc62fc1eeccf736f627a7adfe5485aff9bca9 100644 --- a/sysdeps/aarch64/memset.S +++ b/sysdeps/aarch64/memset.S @@ -89,10 +89,10 @@ L(set_long): b.eq L(try_zva) L(no_zva): sub count, dstend, dst /* Count is 16 too large. */ - add dst, dst, 16 + sub dst, dst, 16 /* Dst is biased by -32. */ sub count, count, 64 + 16 /* Adjust count and bias for loop. */ -1: stp q0, q0, [dst], 64 - stp q0, q0, [dst, -32] +1: stp q0, q0, [dst, 32] + stp q0, q0, [dst, 64]! L(tail64): subs count, count, 64 b.hi 1b