Message ID | 1512105982-17496-1-git-send-email-siddhesh@sourceware.org |
---|---|
State | New |
Headers | show |
Series | aarch64: Improve strcmp unaligned performance | expand |
On Thu, Nov 30, 2017 at 9:26 PM, Siddhesh Poyarekar <siddhesh@sourceware.org> wrote: > Replace the simple byte-wise compare in the misaligned case with a > dword compare with page boundary checks in place. For simplicity I've > chosen a 4K page boundary so that we don't have to query the actual > page size on the system. > > This results in up to 3x improvement in performance in the unaligned > case on falkor and about 2.5x improvement on mustang as measured using > bench-strcmp. On at least Cavium OcteonTX T81 and T83 (and ThunderX1 T88), this might be slower as any unaligned load/stores have a 9 cycle penalty. I will double check this next week. What this means is I will propose a strcmp version for OcteonTX and ThunderX1 above this patch if I find it is slower. Thanks, Andrew Pinski > > * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a > time whenever possible. > --- > sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++-- > 1 file changed, 29 insertions(+), 2 deletions(-) > > diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S > index e99d662..c260e1d 100644 > --- a/sysdeps/aarch64/strcmp.S > +++ b/sysdeps/aarch64/strcmp.S > @@ -72,6 +72,7 @@ L(start_realigned): > cbz syndrome, L(loop_aligned) > /* End of performance-critical section -- one 64B cache line. */ > > +L(end): > #ifndef __AARCH64EB__ > rev syndrome, syndrome > rev data1, data1 > @@ -145,12 +146,38 @@ L(mutual_align): > b L(start_realigned) > > L(misaligned8): > - /* We can do better than this. */ > + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always > + checking to make sure that we don't access beyond page boundary in > + SRC2. */ > + tst src1, #7 > + b.eq L(loop_misaligned) > +L(do_misaligned): > ldrb data1w, [src1], #1 > ldrb data2w, [src2], #1 > cmp data1w, #1 > ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ > - b.eq L(misaligned8) > + b.ne L(done) > + tst src1, #7 > + b.ne L(misaligned8) > + > +L(loop_misaligned): > + /* Test if we are within the last dword of the end of a 4K page. If > + yes then jump back to the misaligned loop to copy a byte at a time. */ > + and tmp1, src2, #0xff8 > + eor tmp1, tmp1, #0xff8 > + cbz tmp1, L(do_misaligned) > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + > + sub tmp1, data1, zeroones > + orr tmp2, data1, #REP8_7f > + eor diff, data1, data2 /* Non-zero if differences found. */ > + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ > + orr syndrome, diff, has_nul > + cbz syndrome, L(loop_misaligned) > + b L(end) > + > +L(done): > sub result, data1, data2 > RET > END(strcmp) > -- > 2.7.5 >
Ping! On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote: > Replace the simple byte-wise compare in the misaligned case with a > dword compare with page boundary checks in place. For simplicity I've > chosen a 4K page boundary so that we don't have to query the actual > page size on the system. > > This results in up to 3x improvement in performance in the unaligned > case on falkor and about 2.5x improvement on mustang as measured using > bench-strcmp. > > * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a > time whenever possible. > --- > sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++-- > 1 file changed, 29 insertions(+), 2 deletions(-) > > diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S > index e99d662..c260e1d 100644 > --- a/sysdeps/aarch64/strcmp.S > +++ b/sysdeps/aarch64/strcmp.S > @@ -72,6 +72,7 @@ L(start_realigned): > cbz syndrome, L(loop_aligned) > /* End of performance-critical section -- one 64B cache line. */ > > +L(end): > #ifndef __AARCH64EB__ > rev syndrome, syndrome > rev data1, data1 > @@ -145,12 +146,38 @@ L(mutual_align): > b L(start_realigned) > > L(misaligned8): > - /* We can do better than this. */ > + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always > + checking to make sure that we don't access beyond page boundary in > + SRC2. */ > + tst src1, #7 > + b.eq L(loop_misaligned) > +L(do_misaligned): > ldrb data1w, [src1], #1 > ldrb data2w, [src2], #1 > cmp data1w, #1 > ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ > - b.eq L(misaligned8) > + b.ne L(done) > + tst src1, #7 > + b.ne L(misaligned8) > + > +L(loop_misaligned): > + /* Test if we are within the last dword of the end of a 4K page. If > + yes then jump back to the misaligned loop to copy a byte at a time. */ > + and tmp1, src2, #0xff8 > + eor tmp1, tmp1, #0xff8 > + cbz tmp1, L(do_misaligned) > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + > + sub tmp1, data1, zeroones > + orr tmp2, data1, #REP8_7f > + eor diff, data1, data2 /* Non-zero if differences found. */ > + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ > + orr syndrome, diff, has_nul > + cbz syndrome, L(loop_misaligned) > + b L(end) > + > +L(done): > sub result, data1, data2 > RET > END(strcmp) >
On Friday 01 December 2017 12:25 PM, Andrew Pinski wrote: > On at least Cavium OcteonTX T81 and T83 (and ThunderX1 T88), this > might be slower as any > unaligned load/stores have a 9 cycle penalty. I will double check > this next week. > What this means is I will propose a strcmp version for OcteonTX and > ThunderX1 above this patch if I find it is slower. Hi Andrew, were you able to test this? Siddhesh
Ping! On Monday 04 December 2017 09:37 PM, Siddhesh Poyarekar wrote: > Ping! > > On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote: >> Replace the simple byte-wise compare in the misaligned case with a >> dword compare with page boundary checks in place. For simplicity I've >> chosen a 4K page boundary so that we don't have to query the actual >> page size on the system. >> >> This results in up to 3x improvement in performance in the unaligned >> case on falkor and about 2.5x improvement on mustang as measured using >> bench-strcmp. >> >> * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a >> time whenever possible. >> --- >> sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++-- >> 1 file changed, 29 insertions(+), 2 deletions(-) >> >> diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S >> index e99d662..c260e1d 100644 >> --- a/sysdeps/aarch64/strcmp.S >> +++ b/sysdeps/aarch64/strcmp.S >> @@ -72,6 +72,7 @@ L(start_realigned): >> cbz syndrome, L(loop_aligned) >> /* End of performance-critical section -- one 64B cache line. */ >> >> +L(end): >> #ifndef __AARCH64EB__ >> rev syndrome, syndrome >> rev data1, data1 >> @@ -145,12 +146,38 @@ L(mutual_align): >> b L(start_realigned) >> >> L(misaligned8): >> - /* We can do better than this. */ >> + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always >> + checking to make sure that we don't access beyond page boundary in >> + SRC2. */ >> + tst src1, #7 >> + b.eq L(loop_misaligned) >> +L(do_misaligned): >> ldrb data1w, [src1], #1 >> ldrb data2w, [src2], #1 >> cmp data1w, #1 >> ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ >> - b.eq L(misaligned8) >> + b.ne L(done) >> + tst src1, #7 >> + b.ne L(misaligned8) >> + >> +L(loop_misaligned): >> + /* Test if we are within the last dword of the end of a 4K page. If >> + yes then jump back to the misaligned loop to copy a byte at a time. */ >> + and tmp1, src2, #0xff8 >> + eor tmp1, tmp1, #0xff8 >> + cbz tmp1, L(do_misaligned) >> + ldr data1, [src1], #8 >> + ldr data2, [src2], #8 >> + >> + sub tmp1, data1, zeroones >> + orr tmp2, data1, #REP8_7f >> + eor diff, data1, data2 /* Non-zero if differences found. */ >> + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ >> + orr syndrome, diff, has_nul >> + cbz syndrome, L(loop_misaligned) >> + b L(end) >> + >> +L(done): >> sub result, data1, data2 >> RET >> END(strcmp) >> >
On 12/12/17 18:20, Siddhesh Poyarekar wrote: > Ping! > > On Monday 04 December 2017 09:37 PM, Siddhesh Poyarekar wrote: >> Ping! >> >> On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote: >>> Replace the simple byte-wise compare in the misaligned case with a >>> dword compare with page boundary checks in place. For simplicity I've >>> chosen a 4K page boundary so that we don't have to query the actual >>> page size on the system. >>> >>> This results in up to 3x improvement in performance in the unaligned >>> case on falkor and about 2.5x improvement on mustang as measured using >>> bench-strcmp. >>> >>> * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a >>> time whenever possible. sorry for the delay, this is ok to commit. i believe it should give a performance improvement in general.
diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S index e99d662..c260e1d 100644 --- a/sysdeps/aarch64/strcmp.S +++ b/sysdeps/aarch64/strcmp.S @@ -72,6 +72,7 @@ L(start_realigned): cbz syndrome, L(loop_aligned) /* End of performance-critical section -- one 64B cache line. */ +L(end): #ifndef __AARCH64EB__ rev syndrome, syndrome rev data1, data1 @@ -145,12 +146,38 @@ L(mutual_align): b L(start_realigned) L(misaligned8): - /* We can do better than this. */ + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always + checking to make sure that we don't access beyond page boundary in + SRC2. */ + tst src1, #7 + b.eq L(loop_misaligned) +L(do_misaligned): ldrb data1w, [src1], #1 ldrb data2w, [src2], #1 cmp data1w, #1 ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ - b.eq L(misaligned8) + b.ne L(done) + tst src1, #7 + b.ne L(misaligned8) + +L(loop_misaligned): + /* Test if we are within the last dword of the end of a 4K page. If + yes then jump back to the misaligned loop to copy a byte at a time. */ + and tmp1, src2, #0xff8 + eor tmp1, tmp1, #0xff8 + cbz tmp1, L(do_misaligned) + ldr data1, [src1], #8 + ldr data2, [src2], #8 + + sub tmp1, data1, zeroones + orr tmp2, data1, #REP8_7f + eor diff, data1, data2 /* Non-zero if differences found. */ + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ + orr syndrome, diff, has_nul + cbz syndrome, L(loop_misaligned) + b L(end) + +L(done): sub result, data1, data2 RET END(strcmp)