Message ID | AM5PR0802MB26106339AAEF3DABB5ACE56F83D80@AM5PR0802MB2610.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
On 06/23/2017 04:28 PM, Wilco Dijkstra wrote: > Sebastian Pop wrote: > >> If I remove all the alignment code, I get less performance on the hikey >> A53 board. >> With this patch: > @@ -142,9 +143,23 @@ ENTRY(memcmp) > > .p2align 6 > .Lmisaligned8: > + > + cmp limit, #8 > + b.lo .LmisalignedLt8 > + > + .p2align 5 > +.Lloop_part_aligned: > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + subs limit_wd, limit_wd, #1 > +.Lstart_part_realigned: > + eor diff, data1, data2 /* Non-zero if differences found. */ > + cbnz diff, .Lnot_limit > + b.ne .Lloop_part_aligned > + > +.LmisalignedLt8: > sub limit, limit, #1 > 1: > - /* Perhaps we can do better than this. */ > ldrb data1w, [src1], #1 > ldrb data2w, [src2], #1 > subs limit, limit, #1 > > Where is the setup of limit_wd and limit??? You are right, my patch was not quite correct: I was missing the initialization of limit_wd, like so: lsr limit_wd, limit, #3 limit is the number of bytes to be compared passed in as a parameter to memcmp. With this extra statement I am still seeing the same low performance on A53 hikey: Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_string_memcmp_unaligned/8 345 ns 345 ns 2026483 22.0879MB/s BM_string_memcmp_unaligned/16 539 ns 539 ns 1298687 28.3159MB/s BM_string_memcmp_unaligned/20 613 ns 613 ns 1142076 31.1222MB/s BM_string_memcmp_unaligned/30 794 ns 794 ns 881596 36.0357MB/s BM_string_memcmp_unaligned/42 957 ns 957 ns 731746 41.8753MB/s BM_string_memcmp_unaligned/55 1208 ns 1207 ns 579591 43.4525MB/s BM_string_memcmp_unaligned/60 1231 ns 1231 ns 568372 46.4756MB/s BM_string_memcmp_unaligned/64 1312 ns 1312 ns 475862 46.5316MB/s The base is with no patch applied to memcmp.S: (byte by byte memcmp) Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_string_memcmp_unaligned/8 339 ns 339 ns 2066820 22.5274MB/s BM_string_memcmp_unaligned/16 536 ns 536 ns 1306265 28.4901MB/s BM_string_memcmp_unaligned/20 612 ns 612 ns 1146573 31.1479MB/s BM_string_memcmp_unaligned/30 789 ns 789 ns 886755 36.2472MB/s BM_string_memcmp_unaligned/42 1009 ns 1009 ns 693760 39.7MB/s BM_string_memcmp_unaligned/55 1233 ns 1233 ns 567719 42.5469MB/s BM_string_memcmp_unaligned/60 1322 ns 1322 ns 529511 43.2804MB/s BM_string_memcmp_unaligned/64 1392 ns 1392 ns 502817 43.8426MB/s And with the patch submitted for review without computing max and aligning on src1: Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_string_memcmp_unaligned/8 282 ns 282 ns 2482713 27.061MB/s BM_string_memcmp_unaligned/16 304 ns 304 ns 2300275 50.1401MB/s BM_string_memcmp_unaligned/20 322 ns 322 ns 2176437 59.2469MB/s BM_string_memcmp_unaligned/30 352 ns 352 ns 1988315 81.328MB/s BM_string_memcmp_unaligned/42 412 ns 412 ns 1699818 97.317MB/s BM_string_memcmp_unaligned/55 503 ns 503 ns 1393029 104.382MB/s BM_string_memcmp_unaligned/60 522 ns 522 ns 1340682 109.619MB/s BM_string_memcmp_unaligned/64 541 ns 541 ns 1297637 112.891MB/s
Sebastian Pop wrote: > On 06/23/2017 04:28 PM, Wilco Dijkstra wrote: > > > Where is the setup of limit_wd and limit??? > > You are right, my patch was not quite correct: I was missing the > initialization of limit_wd, like so: > > lsr limit_wd, limit, #3 > > limit is the number of bytes to be compared passed in as a parameter to > memcmp. You're still missing the setting of limit. Your current version will do the words up to limit - (limit & 7), and then do byte by byte using the original value of limit, so it's going well outside its bounds... Wilco
On Mon, Jun 26, 2017 at 2:00 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Sebastian Pop wrote: >> On 06/23/2017 04:28 PM, Wilco Dijkstra wrote: >> >> > Where is the setup of limit_wd and limit??? >> >> You are right, my patch was not quite correct: I was missing the >> initialization of limit_wd, like so: >> >> lsr limit_wd, limit, #3 >> >> limit is the number of bytes to be compared passed in as a parameter to >> memcmp. > > You're still missing the setting of limit. Your current version will do the > words up to limit - (limit & 7), and then do byte by byte using the original > value of limit, so it's going well outside its bounds... You are right, I was missing the "and limit, limit, #7". With that added the performance looks different than the byte-by-byte memcmp. Still the performance is lower than with aligning src1: Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_string_memcmp_unaligned/8 1288 ns 1288 ns 540945 5.92208MB/s BM_string_memcmp_unaligned/16 1303 ns 1303 ns 537143 11.7123MB/s BM_string_memcmp_unaligned/20 341 ns 341 ns 2064228 55.9994MB/s BM_string_memcmp_unaligned/30 405 ns 405 ns 1726750 70.5799MB/s BM_string_memcmp_unaligned/42 405 ns 405 ns 1728170 98.8833MB/s BM_string_memcmp_unaligned/55 563 ns 562 ns 1239350 93.2568MB/s BM_string_memcmp_unaligned/60 539 ns 539 ns 1298194 106.109MB/s BM_string_memcmp_unaligned/64 2378 ns 2378 ns 359461 25.6695MB/s And for larger data sets the performance is still lower than when aligning src1: Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp_unaligned/8 1288 ns 1288 ns 543230 5.9221MB/s BM_string_memcmp_unaligned/64 2377 ns 2377 ns 359351 25.6742MB/s BM_string_memcmp_unaligned/512 6444 ns 6444 ns 184103 75.7774MB/s BM_string_memcmp_unaligned/1024 4869 ns 4868 ns 143785 200.599MB/s BM_string_memcmp_unaligned/8k 33090 ns 33089 ns 21279 236.107MB/s BM_string_memcmp_unaligned/16k 66748 ns 66738 ns 10436 234.123MB/s BM_string_memcmp_unaligned/32k 131781 ns 131775 ns 5106 237.147MB/s BM_string_memcmp_unaligned/64k 291907 ns 291860 ns 2334 214.143MB/s
Sebastian Pop wrote: > > And for larger data sets the performance is still lower than when aligning src1: Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp_unaligned/8 1288 ns 1288 ns 543230 5.9221MB/s BM_string_memcmp_unaligned/64 2377 ns 2377 ns 359351 25.6742MB/s BM_string_memcmp_unaligned/512 6444 ns 6444 ns 184103 75.7774MB/s BM_string_memcmp_unaligned/1024 4869 ns 4868 ns 143785 200.599MB/s BM_string_memcmp_unaligned/8k 33090 ns 33089 ns 21279 236.107MB/s BM_string_memcmp_unaligned/16k 66748 ns 66738 ns 10436 234.123MB/s BM_string_memcmp_unaligned/32k 131781 ns 131775 ns 5106 237.147MB/s BM_string_memcmp_unaligned/64k 291907 ns 291860 ns 2334 214.143MB/s These numbers still don't make any sense, the first few results are now many times slower than the byte-by-byte version (as in your initial mail): BM_string_memcmp_unaligned/8 339 ns 339 ns 2070998 22.5302MB/s BM_string_memcmp_unaligned/64 1392 ns 1392 ns 502796 43.8454MB/s BM_string_memcmp_unaligned/512 9194 ns 9194 ns 76133 53.1104MB/s BM_string_memcmp_unaligned/1024 18325 ns 18323 ns 38206 53.2963MB/s BM_string_memcmp_unaligned/8k 148579 ns 148574 ns 4713 52.5831MB/s BM_string_memcmp_unaligned/16k 298169 ns 298120 ns 2344 52.4118MB/s BM_string_memcmp_unaligned/32k 598813 ns 598797 ns 1085 52.188MB/s BM_string_memcmp_unaligned/64k 1196079 ns 1196083 ns 540 52.2539MB/s Wilco
Hi, So I had a look at it using the GLIBC bench-memcmp.c. I quickly got the unaligned loop to go faster than the aligned one using ccmp, so I had to tune the unaligned loop too... This is using a similar trick as your byte loop to remove a branch and 1-2 ALU operations per iteration. This gives a 24% speedup on both Cortex-A53 and Cortex-A72 for the aligned loop, and about 18% for the unaligned loop on top of your patch. Aligning either src1 or src2 appears best as there isn't enough work in the loops to hide an unaligned access. Wilco
On Tue, Jun 27, 2017 at 10:11 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > Hi, > > So I had a look at it using the GLIBC bench-memcmp.c. I quickly got the > unaligned loop to go faster than the aligned one using ccmp, so I had to > tune the unaligned loop too... This is using a similar trick as your byte loop to > remove a branch and 1-2 ALU operations per iteration. > > This gives a 24% speedup on both Cortex-A53 and Cortex-A72 for > the aligned loop, and about 18% for the unaligned loop on top of your > patch. Aligning either src1 or src2 appears best as there isn't enough > work in the loops to hide an unaligned access. This looks good. Could you please send a patch? Thanks, Sebastian
Sebastian Pop <sebpop@gmail.com> wrote: > > This looks good. Could you please send a patch? This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html - maybe you could run your benchmark to verify it's faster? Wilco
Hi Wilco, On Thu, Jun 29, 2017 at 9:47 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html - > maybe you could run your benchmark to verify it's faster? Yes, your patch is faster than the patch that I submitted. Here are the numbers from the bionic-benchmarks with your patch: Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp/8 27 ns 27 ns 26198166 286.069MB/s BM_string_memcmp/64 45 ns 45 ns 15553753 1.32443GB/s BM_string_memcmp/512 242 ns 242 ns 2892423 1.97049GB/s BM_string_memcmp/1024 455 ns 455 ns 1537290 2.09436GB/s BM_string_memcmp/8k 3446 ns 3446 ns 203295 2.21392GB/s BM_string_memcmp/16k 7567 ns 7567 ns 92582 2.01657GB/s BM_string_memcmp/32k 16081 ns 16081 ns 43524 1.8977GB/s BM_string_memcmp/64k 31029 ns 31028 ns 22565 1.96712GB/s BM_string_memcmp_aligned/8 184 ns 184 ns 3800912 41.3654MB/s BM_string_memcmp_aligned/64 287 ns 287 ns 2438835 212.65MB/s BM_string_memcmp_aligned/512 1370 ns 1370 ns 511014 356.498MB/s BM_string_memcmp_aligned/1024 2543 ns 2543 ns 275253 384.006MB/s BM_string_memcmp_aligned/8k 20413 ns 20411 ns 34306 382.764MB/s BM_string_memcmp_aligned/16k 42908 ns 42907 ns 16132 364.158MB/s BM_string_memcmp_aligned/32k 88902 ns 88886 ns 8087 351.574MB/s BM_string_memcmp_aligned/64k 173016 ns 173007 ns 4122 361.258MB/s BM_string_memcmp_unaligned/8 212 ns 212 ns 3304163 36.0243MB/s BM_string_memcmp_unaligned/64 361 ns 361 ns 1941597 169.279MB/s BM_string_memcmp_unaligned/512 1754 ns 1753 ns 399210 278.492MB/s BM_string_memcmp_unaligned/1024 3308 ns 3308 ns 211622 295.243MB/s BM_string_memcmp_unaligned/8k 27227 ns 27225 ns 25637 286.964MB/s BM_string_memcmp_unaligned/16k 55877 ns 55874 ns 12455 279.645MB/s BM_string_memcmp_unaligned/32k 112397 ns 112366 ns 6200 278.11MB/s BM_string_memcmp_unaligned/64k 223493 ns 223482 ns 3127 279.665MB/s And here are the numbers for the base (without your patch): Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp/8 30 ns 30 ns 23092418 251.611MB/s BM_string_memcmp/64 58 ns 58 ns 12351397 1046.3MB/s BM_string_memcmp/512 305 ns 305 ns 2297189 1.56479GB/s BM_string_memcmp/1024 571 ns 571 ns 1225040 1.66906GB/s BM_string_memcmp/8k 4307 ns 4306 ns 162561 1.77175GB/s BM_string_memcmp/16k 9429 ns 9429 ns 74212 1.61826GB/s BM_string_memcmp/32k 19166 ns 19164 ns 36521 1.59247GB/s BM_string_memcmp/64k 37035 ns 37033 ns 18924 1.64811GB/s BM_string_memcmp_aligned/8 199 ns 199 ns 3514227 38.3337MB/s BM_string_memcmp_aligned/64 386 ns 386 ns 1811517 158.015MB/s BM_string_memcmp_aligned/512 1731 ns 1730 ns 403250 282.163MB/s BM_string_memcmp_aligned/1024 3198 ns 3198 ns 218758 305.354MB/s BM_string_memcmp_aligned/8k 25217 ns 25214 ns 27483 309.85MB/s BM_string_memcmp_aligned/16k 52015 ns 52013 ns 13527 300.407MB/s BM_string_memcmp_aligned/32k 105034 ns 105034 ns 6689 297.522MB/s BM_string_memcmp_aligned/64k 208464 ns 208465 ns 3427 299.81MB/s BM_string_memcmp_unaligned/8 339 ns 339 ns 2066772 22.537MB/s BM_string_memcmp_unaligned/64 1392 ns 1392 ns 502754 43.8398MB/s BM_string_memcmp_unaligned/512 9194 ns 9194 ns 76133 53.1087MB/s BM_string_memcmp_unaligned/1024 18326 ns 18324 ns 38209 53.2938MB/s BM_string_memcmp_unaligned/8k 148591 ns 148585 ns 4713 52.5793MB/s BM_string_memcmp_unaligned/16k 298256 ns 298207 ns 2343 52.3964MB/s BM_string_memcmp_unaligned/32k 598855 ns 598828 ns 1085 52.1853MB/s BM_string_memcmp_unaligned/64k 1196350 ns 1196355 ns 539 52.242MB/s Thanks, Sebastian
On Thu, Jun 29, 2017 at 12:09 PM, Sebastian Pop <sebpop@gmail.com> wrote: > Hi Wilco, > > On Thu, Jun 29, 2017 at 9:47 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: >> This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html - >> maybe you could run your benchmark to verify it's faster? > > Yes, your patch is faster than the patch that I submitted. Would you like me to submit your patch to Bionic? Thanks, Sebastian
On Thu, Jun 29, 2017 at 12:15 PM, Sebastian Pop <sebpop@gmail.com> wrote: > On Thu, Jun 29, 2017 at 12:09 PM, Sebastian Pop <sebpop@gmail.com> wrote: >> Hi Wilco, >> >> On Thu, Jun 29, 2017 at 9:47 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: >>> This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html - >>> maybe you could run your benchmark to verify it's faster? >> >> Yes, your patch is faster than the patch that I submitted. > > Would you like me to submit your patch to Bionic? I pushed your change as an update to the patch that I already submitted for review in Bionic: https://android-review.googlesource.com/418279 > > Thanks, > Sebastian
Sebastian Pop wrote: > > I pushed your change as an update to the patch that I already > submitted for review in Bionic: > https://android-review.googlesource.com/418279 Thanks, I see it has already been committed to newlib. So I'll post the GLIBC patch soon. There are probably some further tweaks possible, eg. using wider accesses and using unaligned accesses in the < 8 byte case, but that requires a bit more work, so not for this GLIBC release... Wilco
--- a/libc/arch-arm64/generic/bionic/memcmp.S +++ b/libc/arch-arm64/generic/bionic/memcmp.S @@ -159,7 +159,7 @@ ENTRY(memcmp) /* Sources are not aligned align one of the sources find max offset from aligned boundary. */ - and tmp1, src1, #0x7 + and tmp1, src2, #0x7 orr tmp3, xzr, #0x8 sub pos, tmp3, tmp1