Message ID | 20210120092914.256388-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | x86: Refactor and improve performance of strchr-avx2.S | expand |
On Wed, Jan 20, 2021 at 1:34 AM noah via Libc-alpha <libc-alpha@sourceware.org> wrote: > > No bug. Just seemed the performance could be improved a bit. Observed > and expected behavior are unchanged. Optimized body of main > loop. Updated page cross logic and optimized accordingly. Made a few > minor instruction selection modifications. No regressions in test > suite. Both test-strchrnul and test-strchr passed. Thank you very much. > Author: noah <goldstein.w.n@gmail.com> > --- > Possible improvements fall into 5 categories roughly ordered by > importance: > > 1 - The main loop that handles 4 vectors as a time has 4 uops removed > from it. This is at the cost of additional port pressure on ports > 0/1 as vpminub and vpminud can only run on those ports whereas > vpor can run on ports 0/1/5. But the 4 saved uops should more than > make up for that. Analysis either by latency, throughput, or > benchmarks shows its a performance improvement. > > 2 - As far as I can tell the cros_page_boundary logic was broken (or > the jump label was especially confusing). The origional code > tested for this by checking if the load would split cache lines > not pages. I don't think there are any x86_64 architectures that The original code has /* Check if we may cross page boundary with one vector load. */ andl $(2 * VEC_SIZE - 1), %ecx cmpl $VEC_SIZE, %ecx ja L(cros_page_boundary) It is just very conservative. It never fails to cross page boundary. > support both AVX2 (Haskwell and newer) and don't have a minimum > page size of 4kb. Given that the cost of splitting a cacheline > appears to be low on CPUs that support AVX2 > [https://www.agner.org/optimize/blog/read.php?i=415#423] and this > is one-off I don't see it as being critical to avoid. If it > critical to avoid a cacheline split load I think a branchless > approach might be better. Thoughts? Note: Given that the check was > changed to only be for page crosses, I think it is significantly > less likely than before so I moved the branch target from such > prime real estate. I am also unsure if there is a PAGE_SIZE define > in glibc somewhere I can use instead of defining it here. Define PAGE_SIZE to 4096 is fine. > 3 - What was origionally the more_4x_vec label was removed and the > code only does 3 vector blocks now. The reasoning is as follows; > there are two entry points to that code section, from a page cross > or fall through (no page cross). The fall through case is more > important to optimize for assuming the point above. In this case > the incoming alignments (i.e alignment of ptr in rdi) mod 128 can > be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector > blocks optimizes for the [96 - 127] so that when the main 4x loop > is hit, a new 128 byte aligned segment can be started. Doing 3 > vector blocks optimizes for the [0 - 32] case. I generally think > the string is more likely to for aligned to cache line size (or L2 > prefetcher cache line pairs) than at [96 - 127] bytes. An > alternative would be to make that code do 5x vector blocks. This > would mean that at most 2 vector blocks where wasted when > realigning to 128 bytes (as opposed to 3x or 4x which can allow > for 3 vector blocks to be wasted). Thoughts? I picked 4x because it was how I unrolled in other string functions. 3x is fine if it faster than 4x. > 4 - Replace salq using the %cl partial register to sarx. This assumes > BMI2 which is Haskwell and newer but AVX2 implies Hashwell and > newer so I think it is safe. Need to add a BMI2 check in strchr.c: && CPU_FEATURE_USABLE_P (cpu_features, BMI2) > > 5 - in the first_vec_xN return blocks change the addq from using rax > as a destination to rdi as a destination. This just allows for 1 > uop shorter latency. Sounds reasonable. > Benchamrks: > I can post my benchmarking code in the email thread if that is > appreciated. I benchmarked a variety of cases with different > alignments, sizes, and data hotness (in L1, L2, etc...) so I can just > give a simple number of x percentage improvement. They where also run > on my personal computer (icelake-client). Of my 2732 test cases 1985 > saw an improvement with these changes, 747 performed better with the > origional code. I should note, however, that my test cases had a > disproportionate number of cases with 4kb page crosses, which as > discussed I moved to a colder path. Please submit a separate patch to add your workload to benchtests/bench-strstr.c. > In general the affects of this change are: > > large/medium sized strings (from any part of memory really) 10-30% > performance boost. > Small strings that are not page crosses by 10-20% performance boost. > Small strings are cache line splits by 20-30% performance boost. > Small strings that cross a page boundary (4kb page that is) see a 10% > performance regression. > > No regressions in test suite. Both test-strchrnul and test-strchr > passed. It is also used for wcschr and wcschrnul. Do they work correctly with your change? > I would love to here you feedback and all of these points (or any that > I missed). > > FSF Documentation has been signed and returned (via pub key and email > respectively) > > sysdeps/x86_64/multiarch/strchr-avx2.S | 173 +++++++++++++------------ > 1 file changed, 87 insertions(+), 86 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S > index d416558d04..09c2df86d1 100644 > --- a/sysdeps/x86_64/multiarch/strchr-avx2.S > +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S > @@ -27,10 +27,12 @@ > # ifdef USE_AS_WCSCHR > # define VPBROADCAST vpbroadcastd > # define VPCMPEQ vpcmpeqd > +# define VPMINU vpminud > # define CHAR_REG esi > # else > # define VPBROADCAST vpbroadcastb > # define VPCMPEQ vpcmpeqb > +# define VPMINU vpminub > # define CHAR_REG sil > # endif > > @@ -39,7 +41,8 @@ > # endif > > # define VEC_SIZE 32 > - > +# define PAGE_SIZE 4096 > + > .section .text.avx,"ax",@progbits > ENTRY (STRCHR) > movl %edi, %ecx > @@ -48,8 +51,8 @@ ENTRY (STRCHR) > vpxor %xmm9, %xmm9, %xmm9 > VPBROADCAST %xmm0, %ymm0 > /* Check if we may cross page boundary with one vector load. */ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Please update comments. > - andl $(2 * VEC_SIZE - 1), %ecx > - cmpl $VEC_SIZE, %ecx > + andl $(PAGE_SIZE - 1), %ecx > + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx > ja L(cros_page_boundary) Can you also replace cros_page_boundary with cross_page_boundary? > /* Check the first VEC_SIZE bytes. Search for both CHAR and the > @@ -63,45 +66,11 @@ ENTRY (STRCHR) > jnz L(first_vec_x0) > > /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - andl $(VEC_SIZE - 1), %ecx > - andq $-VEC_SIZE, %rdi > - > - jmp L(more_4x_vec) > - > - .p2align 4 > -L(cros_page_boundary): > - andl $(VEC_SIZE - 1), %ecx > - andq $-VEC_SIZE, %rdi > - vmovdqu (%rdi), %ymm8 > - VPCMPEQ %ymm8, %ymm0, %ymm1 > - VPCMPEQ %ymm8, %ymm9, %ymm2 > - vpor %ymm1, %ymm2, %ymm1 > - vpmovmskb %ymm1, %eax > - /* Remove the leading bytes. */ > - sarl %cl, %eax > - testl %eax, %eax > - jz L(aligned_more) > - /* Found CHAR or the null byte. */ > - tzcntl %eax, %eax > - addq %rcx, %rax > -# ifdef USE_AS_STRCHRNUL > - addq %rdi, %rax > -# else > - xorl %edx, %edx > - leaq (%rdi, %rax), %rax > - cmp (%rax), %CHAR_REG > - cmovne %rdx, %rax > -# endif > - VZEROUPPER > - ret > - > - .p2align 4 > + andq $-VEC_SIZE, %rdi > L(aligned_more): > addq $VEC_SIZE, %rdi > > -L(more_4x_vec): > - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > + /* Check the next 3 * VEC_SIZE. Only one VEC_SIZE at a time > since data is only aligned to VEC_SIZE. */ > vmovdqa (%rdi), %ymm8 > VPCMPEQ %ymm8, %ymm0, %ymm1 > @@ -127,19 +96,9 @@ L(more_4x_vec): > testl %eax, %eax > jnz L(first_vec_x2) > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > - VPCMPEQ %ymm8, %ymm0, %ymm1 > - VPCMPEQ %ymm8, %ymm9, %ymm2 > - vpor %ymm1, %ymm2, %ymm1 > - vpmovmskb %ymm1, %eax > - testl %eax, %eax > - jnz L(first_vec_x3) > - > - addq $(VEC_SIZE * 4), %rdi > - > + > /* Align data to 4 * VEC_SIZE. */ > - movq %rdi, %rcx > - andl $(4 * VEC_SIZE - 1), %ecx > + addq $(VEC_SIZE * 3), %rdi > andq $-(4 * VEC_SIZE), %rdi > > .p2align 4 > @@ -150,34 +109,61 @@ L(loop_4x_vec): > vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 > vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > - VPCMPEQ %ymm5, %ymm0, %ymm1 > - VPCMPEQ %ymm6, %ymm0, %ymm2 > - VPCMPEQ %ymm7, %ymm0, %ymm3 > - VPCMPEQ %ymm8, %ymm0, %ymm4 > - > - VPCMPEQ %ymm5, %ymm9, %ymm5 > - VPCMPEQ %ymm6, %ymm9, %ymm6 > - VPCMPEQ %ymm7, %ymm9, %ymm7 > - VPCMPEQ %ymm8, %ymm9, %ymm8 > + /* Leaves only CHARS matching esi as 0. */ > + vpxor %ymm5, %ymm0, %ymm1 > + vpxor %ymm6, %ymm0, %ymm2 > + vpxor %ymm7, %ymm0, %ymm3 > + vpxor %ymm8, %ymm0, %ymm4 > > - vpor %ymm1, %ymm5, %ymm1 > - vpor %ymm2, %ymm6, %ymm2 > - vpor %ymm3, %ymm7, %ymm3 > - vpor %ymm4, %ymm8, %ymm4 > + VPMINU %ymm1, %ymm5, %ymm1 > + VPMINU %ymm2, %ymm6, %ymm2 > + VPMINU %ymm3, %ymm7, %ymm3 > + VPMINU %ymm4, %ymm8, %ymm4 > > - vpor %ymm1, %ymm2, %ymm5 > - vpor %ymm3, %ymm4, %ymm6 > + VPMINU %ymm1, %ymm2, %ymm5 > + VPMINU %ymm3, %ymm4, %ymm6 > > - vpor %ymm5, %ymm6, %ymm5 > + VPMINU %ymm5, %ymm6, %ymm5 > > + VPCMPEQ %ymm5, %ymm9, %ymm5 > vpmovmskb %ymm5, %eax > + addq $(VEC_SIZE * 4), %rdi > + > testl %eax, %eax > - jnz L(4x_vec_end) > + jz L(loop_4x_vec) > > - addq $(VEC_SIZE * 4), %rdi > + subq $(VEC_SIZE * 4), %rdi > + > +L(4x_vec_end): > + VPCMPEQ %ymm1, %ymm9, %ymm1 > + vpmovmskb %ymm1, %eax > + testl %eax, %eax > + jnz L(first_vec_x0) > + VPCMPEQ %ymm2, %ymm9, %ymm2 > + vpmovmskb %ymm2, %eax > + testl %eax, %eax > + jnz L(first_vec_x1) > + VPCMPEQ %ymm3, %ymm9, %ymm3 > + vpmovmskb %ymm3, %eax > + testl %eax, %eax > + jnz L(first_vec_x2) > + VPCMPEQ %ymm4, %ymm9, %ymm4 > + vpmovmskb %ymm4, %eax > > - jmp L(loop_4x_vec) > + tzcntl %eax, %eax > +# ifdef USE_AS_STRCHRNUL > + addq $(VEC_SIZE * 3), %rdi > + addq %rdi, %rax > +# else > + xorl %edx, %edx > + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > + cmp (%rax), %CHAR_REG > + cmovne %rdx, %rax > +# endif > + VZEROUPPER > + ret > > + > .p2align 4 > L(first_vec_x0): > /* Found CHAR or the null byte. */ > @@ -197,7 +183,7 @@ L(first_vec_x0): > L(first_vec_x1): > tzcntl %eax, %eax > # ifdef USE_AS_STRCHRNUL > - addq $VEC_SIZE, %rax > + addq $VEC_SIZE, %rdi > addq %rdi, %rax > # else > xorl %edx, %edx > @@ -212,7 +198,7 @@ L(first_vec_x1): > L(first_vec_x2): > tzcntl %eax, %eax > # ifdef USE_AS_STRCHRNUL > - addq $(VEC_SIZE * 2), %rax > + addq $(VEC_SIZE * 2), %rdi > addq %rdi, %rax > # else > xorl %edx, %edx > @@ -223,32 +209,47 @@ L(first_vec_x2): > VZEROUPPER > ret > > + /* Cold case for crossing page with first load. */ > .p2align 4 > -L(4x_vec_end): > +L(cros_page_boundary): > + andl $(VEC_SIZE - 1), %ecx > + andq $-VEC_SIZE, %rdi > + vmovdqu (%rdi), %ymm8 > + VPCMPEQ %ymm8, %ymm0, %ymm1 > + VPCMPEQ %ymm8, %ymm9, %ymm2 > + vpor %ymm1, %ymm2, %ymm1 > vpmovmskb %ymm1, %eax > + /* Remove the leading bits. */ > + sarxl %ecx, %eax, %eax > testl %eax, %eax > - jnz L(first_vec_x0) > - vpmovmskb %ymm2, %eax > - testl %eax, %eax > - jnz L(first_vec_x1) > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(first_vec_x2) > - vpmovmskb %ymm4, %eax > + jnz L(cros_page_return) > + > + /* Second block so that the 3 other blocks from L(aligned_more) > + will get to next 4 * VEC_SIZE alignment. */ > + andq $-VEC_SIZE, %rdi > + addq $VEC_SIZE, %rdi > + xorl %ecx, %ecx > + vmovdqa (%rdi), %ymm8 > + VPCMPEQ %ymm8, %ymm0, %ymm1 > + VPCMPEQ %ymm8, %ymm9, %ymm2 > + vpor %ymm1, %ymm2, %ymm1 > + vpmovmskb %ymm1, %eax > testl %eax, %eax > -L(first_vec_x3): > + jz L(aligned_more) > + > +L(cros_page_return): > tzcntl %eax, %eax > + addq %rcx, %rax > # ifdef USE_AS_STRCHRNUL > - addq $(VEC_SIZE * 3), %rax > addq %rdi, %rax > # else > xorl %edx, %edx > - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > + leaq (%rdi, %rax), %rax > cmp (%rax), %CHAR_REG > cmovne %rdx, %rax > # endif > VZEROUPPER > ret > - > + > END (STRCHR) > -#endif > +# endif > -- > 2.29.2 >
diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S index d416558d04..09c2df86d1 100644 --- a/sysdeps/x86_64/multiarch/strchr-avx2.S +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S @@ -27,10 +27,12 @@ # ifdef USE_AS_WCSCHR # define VPBROADCAST vpbroadcastd # define VPCMPEQ vpcmpeqd +# define VPMINU vpminud # define CHAR_REG esi # else # define VPBROADCAST vpbroadcastb # define VPCMPEQ vpcmpeqb +# define VPMINU vpminub # define CHAR_REG sil # endif @@ -39,7 +41,8 @@ # endif # define VEC_SIZE 32 - +# define PAGE_SIZE 4096 + .section .text.avx,"ax",@progbits ENTRY (STRCHR) movl %edi, %ecx @@ -48,8 +51,8 @@ ENTRY (STRCHR) vpxor %xmm9, %xmm9, %xmm9 VPBROADCAST %xmm0, %ymm0 /* Check if we may cross page boundary with one vector load. */ - andl $(2 * VEC_SIZE - 1), %ecx - cmpl $VEC_SIZE, %ecx + andl $(PAGE_SIZE - 1), %ecx + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx ja L(cros_page_boundary) /* Check the first VEC_SIZE bytes. Search for both CHAR and the @@ -63,45 +66,11 @@ ENTRY (STRCHR) jnz L(first_vec_x0) /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - - jmp L(more_4x_vec) - - .p2align 4 -L(cros_page_boundary): - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - vmovdqu (%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - /* Remove the leading bytes. */ - sarl %cl, %eax - testl %eax, %eax - jz L(aligned_more) - /* Found CHAR or the null byte. */ - tzcntl %eax, %eax - addq %rcx, %rax -# ifdef USE_AS_STRCHRNUL - addq %rdi, %rax -# else - xorl %edx, %edx - leaq (%rdi, %rax), %rax - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax -# endif - VZEROUPPER - ret - - .p2align 4 + andq $-VEC_SIZE, %rdi L(aligned_more): addq $VEC_SIZE, %rdi -L(more_4x_vec): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time + /* Check the next 3 * VEC_SIZE. Only one VEC_SIZE at a time since data is only aligned to VEC_SIZE. */ vmovdqa (%rdi), %ymm8 VPCMPEQ %ymm8, %ymm0, %ymm1 @@ -127,19 +96,9 @@ L(more_4x_vec): testl %eax, %eax jnz L(first_vec_x2) - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(first_vec_x3) - - addq $(VEC_SIZE * 4), %rdi - + /* Align data to 4 * VEC_SIZE. */ - movq %rdi, %rcx - andl $(4 * VEC_SIZE - 1), %ecx + addq $(VEC_SIZE * 3), %rdi andq $-(4 * VEC_SIZE), %rdi .p2align 4 @@ -150,34 +109,61 @@ L(loop_4x_vec): vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - VPCMPEQ %ymm5, %ymm0, %ymm1 - VPCMPEQ %ymm6, %ymm0, %ymm2 - VPCMPEQ %ymm7, %ymm0, %ymm3 - VPCMPEQ %ymm8, %ymm0, %ymm4 - - VPCMPEQ %ymm5, %ymm9, %ymm5 - VPCMPEQ %ymm6, %ymm9, %ymm6 - VPCMPEQ %ymm7, %ymm9, %ymm7 - VPCMPEQ %ymm8, %ymm9, %ymm8 + /* Leaves only CHARS matching esi as 0. */ + vpxor %ymm5, %ymm0, %ymm1 + vpxor %ymm6, %ymm0, %ymm2 + vpxor %ymm7, %ymm0, %ymm3 + vpxor %ymm8, %ymm0, %ymm4 - vpor %ymm1, %ymm5, %ymm1 - vpor %ymm2, %ymm6, %ymm2 - vpor %ymm3, %ymm7, %ymm3 - vpor %ymm4, %ymm8, %ymm4 + VPMINU %ymm1, %ymm5, %ymm1 + VPMINU %ymm2, %ymm6, %ymm2 + VPMINU %ymm3, %ymm7, %ymm3 + VPMINU %ymm4, %ymm8, %ymm4 - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 + VPMINU %ymm1, %ymm2, %ymm5 + VPMINU %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 + VPMINU %ymm5, %ymm6, %ymm5 + VPCMPEQ %ymm5, %ymm9, %ymm5 vpmovmskb %ymm5, %eax + addq $(VEC_SIZE * 4), %rdi + testl %eax, %eax - jnz L(4x_vec_end) + jz L(loop_4x_vec) - addq $(VEC_SIZE * 4), %rdi + subq $(VEC_SIZE * 4), %rdi + +L(4x_vec_end): + VPCMPEQ %ymm1, %ymm9, %ymm1 + vpmovmskb %ymm1, %eax + testl %eax, %eax + jnz L(first_vec_x0) + VPCMPEQ %ymm2, %ymm9, %ymm2 + vpmovmskb %ymm2, %eax + testl %eax, %eax + jnz L(first_vec_x1) + VPCMPEQ %ymm3, %ymm9, %ymm3 + vpmovmskb %ymm3, %eax + testl %eax, %eax + jnz L(first_vec_x2) + VPCMPEQ %ymm4, %ymm9, %ymm4 + vpmovmskb %ymm4, %eax - jmp L(loop_4x_vec) + tzcntl %eax, %eax +# ifdef USE_AS_STRCHRNUL + addq $(VEC_SIZE * 3), %rdi + addq %rdi, %rax +# else + xorl %edx, %edx + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + cmp (%rax), %CHAR_REG + cmovne %rdx, %rax +# endif + VZEROUPPER + ret + .p2align 4 L(first_vec_x0): /* Found CHAR or the null byte. */ @@ -197,7 +183,7 @@ L(first_vec_x0): L(first_vec_x1): tzcntl %eax, %eax # ifdef USE_AS_STRCHRNUL - addq $VEC_SIZE, %rax + addq $VEC_SIZE, %rdi addq %rdi, %rax # else xorl %edx, %edx @@ -212,7 +198,7 @@ L(first_vec_x1): L(first_vec_x2): tzcntl %eax, %eax # ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 2), %rax + addq $(VEC_SIZE * 2), %rdi addq %rdi, %rax # else xorl %edx, %edx @@ -223,32 +209,47 @@ L(first_vec_x2): VZEROUPPER ret + /* Cold case for crossing page with first load. */ .p2align 4 -L(4x_vec_end): +L(cros_page_boundary): + andl $(VEC_SIZE - 1), %ecx + andq $-VEC_SIZE, %rdi + vmovdqu (%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %eax + /* Remove the leading bits. */ + sarxl %ecx, %eax, %eax testl %eax, %eax - jnz L(first_vec_x0) - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(first_vec_x1) - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(first_vec_x2) - vpmovmskb %ymm4, %eax + jnz L(cros_page_return) + + /* Second block so that the 3 other blocks from L(aligned_more) + will get to next 4 * VEC_SIZE alignment. */ + andq $-VEC_SIZE, %rdi + addq $VEC_SIZE, %rdi + xorl %ecx, %ecx + vmovdqa (%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax testl %eax, %eax -L(first_vec_x3): + jz L(aligned_more) + +L(cros_page_return): tzcntl %eax, %eax + addq %rcx, %rax # ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 3), %rax addq %rdi, %rax # else xorl %edx, %edx - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + leaq (%rdi, %rax), %rax cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER ret - + END (STRCHR) -#endif +# endif