Message ID | 20210913230506.546749-5-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2, and evex | expand |
On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote: > No bug. This commit adds new optimized bcmp implementation for evex. > > The primary optimizations are 1) skipping the logic to find the > difference of the first mismatched byte and 2) not updating src/dst > addresses as the non-equals logic does not need to be reused by > different areas. > > The entry alignment has been fixed at 64. In throughput sensitive > functions which bcmp can potentially be frontend loop performance is > important to opimized for. This is impossible/difficult to do/maintain > with only 16 byte fixed alignment. > > test-memcmp, test-bcmp, and test-wmemcmp are all passing. This series fails in the containerized 32-bit x86 CI/CD regression tester. https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/
On Mon, Sep 13, 2021 at 8:18 PM Carlos O'Donell <carlos@redhat.com> wrote: > On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote: > > No bug. This commit adds new optimized bcmp implementation for evex. > > > > The primary optimizations are 1) skipping the logic to find the > > difference of the first mismatched byte and 2) not updating src/dst > > addresses as the non-equals logic does not need to be reused by > > different areas. > > > > The entry alignment has been fixed at 64. In throughput sensitive > > functions which bcmp can potentially be frontend loop performance is > > important to opimized for. This is impossible/difficult to do/maintain > > with only 16 byte fixed alignment. > > > > test-memcmp, test-bcmp, and test-wmemcmp are all passing. > > This series fails in the containerized 32-bit x86 CI/CD regression tester. > > https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/ Shoot. AFAICT the first error is: *** No rule to make target '/build/string/stamp.os', needed by '/build/libc_pic.a'. I saw that issue earlier when I was working on just supporting bcmp for the first commit: [PATCH 1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2, and evex So I think I missed/messed up something there regarding the necessary changes to the Makefile/build infrastructure to support the change. While it doesn't appear to be an issue on my local machine I left the redirect in string/memcmp.c: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memcmp.c;h=9b46d7a905c8b7886f046b7660f63df10dc4573c;hb=HEAD#l360 But was one area where I didn't really know the right answer. Does anyone know if there is anything special that needs to be done for the 32 bit build when adding a new implementation? Also, does anyone know what make/configure commands I need to reproduce this on a x86_64-Linux machine? The build log doesn't appear to have the command. For my completely fresh build / testing I ran: rm -rf /path/to/build/glibc; mkdir -p /path/to/build/glibc; (cd /path/to/build/glibc/; unset LD_LIBRARY_PATH; /path/to/src/glibc/configure --prefix=/usr; make --silent; make xcheck; make -r -C /path/to/src/glibc/string/ objdir=`pwd` check; make -r -C /path/to/src/glibc/wcsmbs/ objdir=`pwd` check) which doesn't appear to have cut it. > > -- > Cheers, > Carlos. > >
On 9/13/21 10:05 PM, Noah Goldstein wrote: > On Mon, Sep 13, 2021 at 8:18 PM Carlos O'Donell <carlos@redhat.com> wrote: > >> On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote: >>> No bug. This commit adds new optimized bcmp implementation for evex. >>> >>> The primary optimizations are 1) skipping the logic to find the >>> difference of the first mismatched byte and 2) not updating src/dst >>> addresses as the non-equals logic does not need to be reused by >>> different areas. >>> >>> The entry alignment has been fixed at 64. In throughput sensitive >>> functions which bcmp can potentially be frontend loop performance is >>> important to opimized for. This is impossible/difficult to do/maintain >>> with only 16 byte fixed alignment. >>> >>> test-memcmp, test-bcmp, and test-wmemcmp are all passing. >> >> This series fails in the containerized 32-bit x86 CI/CD regression tester. >> >> https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/ > > > Shoot. No worries! That's what the CI/CD system is there for :-) > AFAICT the first error is: > *** No rule to make target '/build/string/stamp.os', needed by > '/build/libc_pic.a'. I think a normal 32-bit x86 builds should show this issue. You need a gcc that accepts -m32. I minimally set: export CC="gcc -m32 -Wl,--build-id=none" export CXX="g++ -m32 -Wl,--build-id=none" export CFLAGS="-g -O2 -march=i686 -Wl,--build-id=none" export CXXFLAGS="-g -O2 -march=i686 -Wl,--build-id=none" export CPPFLAGS="-g -O2 -march=i686 -Wl,--build-id=none" Then build with --host. e.g. /home/carlos/src/glibc-work/configure --host i686-pc-linux-gnu CC=gcc -m32 -Wl,--build-id=none CFLAGS=-g -O2 -march=i686 -Wl,--build-id=none CPPFLAGS=-g -O2 -march=i686 -Wl,--build-id=none CXX=g++ -m32 -Wl,--build-id=none CXXFLAGS=-g -O2 -march=i686 -Wl,--build-id=none --prefix=/usr --with-headers=/home/carlos/build/glibc-headers-work-i686/include --with-selinux --disable-nss-crypt --enable-bind-now --enable-static-pie --enable-systemtap --enable-hardcoded-path-in-tests --enable-tunables=yes --enable-add-ons > Also, does anyone know what make/configure commands I need to reproduce > this on a x86_64-Linux machine? The build log doesn't appear to have the command. DJ, Should the trybot log the configure step?
"Carlos O'Donell" <carlos@redhat.com> writes: >> Also, does anyone know what make/configure commands I need to reproduce >> this on a x86_64-Linux machine? The build log doesn't appear to have the command. > > DJ, Should the trybot log the configure step? Perhaps. It's in the stdout that gets added to the trybot's general log file, rather than a per-series log (and in the git repo's sample script ;). It's: /glibc/configure CC="gcc -m32" CXX="g++ -m32" --prefix=/usr \ --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu However, this doesn't smell like a 64-vs-32 bug, but a x86-64 vs anything-else bug. (It's also in build-many-glibcs.py)
On Mon, Sep 13, 2021 at 9:55 PM DJ Delorie <dj@redhat.com> wrote: > "Carlos O'Donell" <carlos@redhat.com> writes: > >> Also, does anyone know what make/configure commands I need to reproduce > >> this on a x86_64-Linux machine? The build log doesn't appear to have > the command. > > > > DJ, Should the trybot log the configure step? > > Perhaps. It's in the stdout that gets added to the trybot's general log > file, rather than a per-series log (and in the git repo's sample script > ;). It's: > > /glibc/configure CC="gcc -m32" CXX="g++ -m32" --prefix=/usr \ > --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu > Thanks I was able to reproduce the bug with that. Thanks! > > However, this doesn't smell like a 64-vs-32 bug, but a x86-64 vs > anything-else bug. > That makes sense. > > (It's also in build-many-glibcs.py) > Thanks!
On Mon, Sep 13, 2021 at 9:35 PM Carlos O'Donell <carlos@redhat.com> wrote: > On 9/13/21 10:05 PM, Noah Goldstein wrote: > > On Mon, Sep 13, 2021 at 8:18 PM Carlos O'Donell <carlos@redhat.com> > wrote: > > > >> On 9/13/21 7:05 PM, Noah Goldstein via Libc-alpha wrote: > >>> No bug. This commit adds new optimized bcmp implementation for evex. > >>> > >>> The primary optimizations are 1) skipping the logic to find the > >>> difference of the first mismatched byte and 2) not updating src/dst > >>> addresses as the non-equals logic does not need to be reused by > >>> different areas. > >>> > >>> The entry alignment has been fixed at 64. In throughput sensitive > >>> functions which bcmp can potentially be frontend loop performance is > >>> important to opimized for. This is impossible/difficult to do/maintain > >>> with only 16 byte fixed alignment. > >>> > >>> test-memcmp, test-bcmp, and test-wmemcmp are all passing. > >> > >> This series fails in the containerized 32-bit x86 CI/CD regression > tester. > >> > >> > https://patchwork.sourceware.org/project/glibc/patch/20210913230506.546749-5-goldstein.w.n@gmail.com/ > > > > > > Shoot. > > No worries! That's what the CI/CD system is there for :-) > > > AFAICT the first error is: > > *** No rule to make target '/build/string/stamp.os', needed by > > '/build/libc_pic.a'. > > I think a normal 32-bit x86 builds should show this issue. > > You need a gcc that accepts -m32. > Was able to get it with DJ's command. > > I minimally set: > export CC="gcc -m32 -Wl,--build-id=none" > export CXX="g++ -m32 -Wl,--build-id=none" > export CFLAGS="-g -O2 -march=i686 -Wl,--build-id=none" > export CXXFLAGS="-g -O2 -march=i686 -Wl,--build-id=none" > export CPPFLAGS="-g -O2 -march=i686 -Wl,--build-id=none" > > Then build with --host. > > e.g. > > /home/carlos/src/glibc-work/configure --host i686-pc-linux-gnu CC=gcc -m32 > -Wl,--build-id=none CFLAGS=-g -O2 -march=i686 -Wl,--build-id=none > CPPFLAGS=-g -O2 -march=i686 -Wl,--build-id=none CXX=g++ -m32 > -Wl,--build-id=none CXXFLAGS=-g -O2 -march=i686 -Wl,--build-id=none > --prefix=/usr > --with-headers=/home/carlos/build/glibc-headers-work-i686/include > --with-selinux --disable-nss-crypt --enable-bind-now --enable-static-pie > --enable-systemtap --enable-hardcoded-path-in-tests --enable-tunables=yes > --enable-add-ons Thanks for the help! > > > Also, does anyone know what make/configure commands I need to reproduce > > this on a x86_64-Linux machine? The build log doesn't appear to have the > command. > > DJ, Should the trybot log the configure step? > > So I think I was able to fix the build by making a new file in glibc/string/bcmp.c and just having bcmp call memcmp Is there another/better way to fix the build? I don't think it's really fair that every arch other than x86_64 should have to pay an extra function call cost to use bcmp. > -- > Cheers, > Carlos. > >
Noah Goldstein <goldstein.w.n@gmail.com> writes: > So I think I was able to fix the build by making a new file in glibc/string/bcmp.c > and just having bcmp call memcmp > > Is there another/better way to fix the build? I don't think it's really fair that every > arch other than x86_64 should have to pay an extra function call cost to use bcmp. There are at least three... First, note that bcmp is a weak alias to memcmp already - see strings/memcmp.c - which avoids the extra call you mention. So, you could either move that weak alias into bcmp.c, or arrange for bcmp.c to not be needed by the Makefile for non-x86_64 platforms. Lastly, an empty bcmp.c wouldn't override the alias in memcmp.c. I think the first would be easiest, although it may be tricky to compile a source file that seems to do "nothing". Also, I suspect liberal use of comments would be beneficial for the unsuspecting reader ;-) Alternately, you could change your patch to provide alternate versions of memcmp() instead of bcmp(), as glibc's bcmp *is* memcmp. This is what other arches (and x86_64) do: $ find . -name 'memcmp*' -print
On Mon, Sep 13, 2021 at 11:21 PM DJ Delorie <dj@redhat.com> wrote: > Noah Goldstein <goldstein.w.n@gmail.com> writes: > > So I think I was able to fix the build by making a new file in > glibc/string/bcmp.c > > and just having bcmp call memcmp > > > > Is there another/better way to fix the build? I don't think it's really > fair that every > > arch other than x86_64 should have to pay an extra function call cost to > use bcmp. > > There are at least three... > > First, note that bcmp is a weak alias to memcmp already - see > strings/memcmp.c - which avoids the extra call you mention. > > So, you could either move that weak alias into bcmp.c, or arrange for > bcmp.c to not be needed by the Makefile for non-x86_64 platforms. > Lastly, an empty bcmp.c wouldn't override the alias in memcmp.c. I > think the first would be easiest, although it may be tricky to compile a > source file that seems to do "nothing". Also, I suspect liberal use of > comments would be beneficial for the unsuspecting reader ;-) > > I see. I was able to get it working with just an empty bcmp.c file but was not able to move the weak_alias from memcmp.c to bcmp.c Adding: ``` #ifdef weak_alias # undef bcmp weak_alias (memcmp, bcmp) #endif ``` to bcmp.c gets me the following compiler error: ``` bcmp.c:24:21: error: ‘bcmp’ aliased to undefined symbol ‘memcmp’ ``` irrespective of the ifdef/undef and whether I include string.h/manually put in a prototype of memcmp. Sorry for the hassle. Build infrastructure, especially in a project as complex as this, is a bit out of my domain. > Alternately, you could change your patch to provide alternate versions > of memcmp() instead of bcmp(), as glibc's bcmp *is* memcmp. This is > what other arches (and x86_64) do: > I'm not 100% sure what you mean? memcmp can correctly implement bcmp but not the vice versa. > > $ find . -name 'memcmp*' -print > >
Noah Goldstein <goldstein.w.n@gmail.com> writes: > I'm not 100% sure what you mean? memcmp can correctly implement bcmp > but not the vice versa. glibc does not have a separate implementation of bcmp(). Any calls to bcmp() end up calling memcmp() (through that weak alias). So your patch is not *optimizing* bcmp, it is *adding* bcmp. The new version you are adding is no longer using the optimized versions of memcmp, so you'd have to either (1) be very careful to not introduce a performance regression, or (2) optimize the existing memcmp()s further instead.
On Tue, Sep 14, 2021 at 12:42 AM DJ Delorie <dj@redhat.com> wrote: > Noah Goldstein <goldstein.w.n@gmail.com> writes: > > I'm not 100% sure what you mean? memcmp can correctly implement bcmp > > but not the vice versa. > > glibc does not have a separate implementation of bcmp(). Any calls to > bcmp() end up calling memcmp() (through that weak alias). So your patch > is not *optimizing* bcmp, it is *adding* bcmp. The new version you are > adding is no longer using the optimized versions of memcmp, so you'd > have to either (1) be very careful to not introduce a performance > regression, or (2) optimize the existing memcmp()s further instead. > Ah, got it. In the first patch of the set: [PATCH 1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2, and evex I have some performance numbers. Seems to be an improvement for avx2/evex. The sse2/sse4 stuff is a bit more iffy. I don't really have the hardware to properly test those versions. Thank you for all the help!
diff --git a/sysdeps/x86_64/multiarch/bcmp-evex.S b/sysdeps/x86_64/multiarch/bcmp-evex.S index ade52e8c68..1bfe824eb4 100644 --- a/sysdeps/x86_64/multiarch/bcmp-evex.S +++ b/sysdeps/x86_64/multiarch/bcmp-evex.S @@ -16,8 +16,305 @@ License along with the GNU C Library; if not, see <https://www.gnu.org/licenses/>. */ -#ifndef MEMCMP -# define MEMCMP __bcmp_evex -#endif +#if IS_IN (libc) + +/* bcmp is implemented as: + 1. Use ymm vector compares when possible. The only case where + vector compares is not possible for when size < VEC_SIZE + and loading from either s1 or s2 would cause a page cross. + 2. Use xmm vector compare when size >= 8 bytes. + 3. Optimistically compare up to first 4 * VEC_SIZE one at a + to check for early mismatches. Only do this if its guranteed the + work is not wasted. + 4. If size is 8 * VEC_SIZE or less, unroll the loop. + 5. Compare 4 * VEC_SIZE at a time with the aligned first memory + area. + 6. Use 2 vector compares when size is 2 * VEC_SIZE or less. + 7. Use 4 vector compares when size is 4 * VEC_SIZE or less. + 8. Use 8 vector compares when size is 8 * VEC_SIZE or less. */ + +# include <sysdep.h> + +# ifndef BCMP +# define BCMP __bcmp_evex +# endif + +# define VMOVU vmovdqu64 +# define VPCMP vpcmpub +# define VPTEST vptestmb + +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + +# define YMM0 ymm16 +# define YMM1 ymm17 +# define YMM2 ymm18 +# define YMM3 ymm19 +# define YMM4 ymm20 +# define YMM5 ymm21 +# define YMM6 ymm22 + + + .section .text.evex, "ax", @progbits +ENTRY_P2ALIGN (BCMP, 6) +# ifdef __ILP32__ + /* Clear the upper 32 bits. */ + movl %edx, %edx +# endif + cmp $VEC_SIZE, %RDX_LP + jb L(less_vec) + + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ + VMOVU (%rsi), %YMM1 + /* Use compare not equals to directly check for mismatch. */ + VPCMP $4, (%rdi), %YMM1, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + cmpq $(VEC_SIZE * 2), %rdx + jbe L(last_1x_vec) + + /* Check second VEC no matter what. */ + VMOVU VEC_SIZE(%rsi), %YMM2 + VPCMP $4, VEC_SIZE(%rdi), %YMM2, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + /* Less than 4 * VEC. */ + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_2x_vec) + + /* Check third and fourth VEC no matter what. */ + VMOVU (VEC_SIZE * 2)(%rsi), %YMM3 + VPCMP $4, (VEC_SIZE * 2)(%rdi), %YMM3, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + VMOVU (VEC_SIZE * 3)(%rsi), %YMM4 + VPCMP $4, (VEC_SIZE * 3)(%rdi), %YMM4, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + /* Go to 4x VEC loop. */ + cmpq $(VEC_SIZE * 8), %rdx + ja L(more_8x_vec) + + /* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any + branches. */ + + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %YMM1 + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %YMM2 + addq %rdx, %rdi + + /* Wait to load from s1 until addressed adjust due to unlamination. + */ + + /* vpxor will be all 0s if s1 and s2 are equal. Otherwise it will + have some 1s. */ + vpxorq -(VEC_SIZE * 4)(%rdi), %YMM1, %YMM1 + vpxorq -(VEC_SIZE * 3)(%rdi), %YMM2, %YMM2 + + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %YMM3 + vpxorq -(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3 + /* Or together YMM1, YMM2, and YMM3 into YMM3. */ + vpternlogd $0xfe, %YMM1, %YMM2, %YMM3 -#include "memcmp-evex-movbe.S" + VMOVU -(VEC_SIZE)(%rsi, %rdx), %YMM4 + /* Ternary logic to xor (VEC_SIZE * 3)(%rdi) with YMM4 while oring + with YMM3. Result is stored in YMM4. */ + vpternlogd $0xde, -(VEC_SIZE)(%rdi), %YMM3, %YMM4 + /* Compare YMM4 with 0. If any 1s s1 and s2 don't match. */ + VPTEST %YMM4, %YMM4, %k1 + kmovd %k1, %eax +L(return_neq0): + ret + + /* Fits in padding needed to .p2align 5 L(less_vec). */ +L(last_1x_vec): + VMOVU -(VEC_SIZE * 1)(%rsi, %rdx), %YMM1 + VPCMP $4, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %k1 + kmovd %k1, %eax + ret + + /* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte + aligned. */ + .p2align 5 +L(less_vec): + /* Check if one or less char. This is necessary for size = 0 but is + also faster for size = 1. */ + cmpl $1, %edx + jbe L(one_or_less) + + /* Check if loading one VEC from either s1 or s2 could cause a page + cross. This can have false positives but is by far the fastest + method. */ + movl %edi, %eax + orl %esi, %eax + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + jg L(page_cross_less_vec) + + /* No page cross possible. */ + VMOVU (%rsi), %YMM2 + VPCMP $4, (%rdi), %YMM2, %k1 + kmovd %k1, %eax + /* Result will be zero if s1 and s2 match. Otherwise first set bit + will be first mismatch. */ + bzhil %edx, %eax, %eax + ret + + /* Relatively cold but placing close to L(less_vec) for 2 byte jump + encoding. */ + .p2align 4 +L(one_or_less): + jb L(zero) + movzbl (%rsi), %ecx + movzbl (%rdi), %eax + subl %ecx, %eax + /* No ymm register was touched. */ + ret + /* Within the same 16 byte block is L(one_or_less). */ +L(zero): + xorl %eax, %eax + ret + + .p2align 4 +L(last_2x_vec): + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %YMM1 + vpxorq -(VEC_SIZE * 2)(%rdi, %rdx), %YMM1, %YMM1 + VMOVU -(VEC_SIZE * 1)(%rsi, %rdx), %YMM2 + vpternlogd $0xde, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %YMM2 + VPTEST %YMM2, %YMM2, %k1 + kmovd %k1, %eax + ret + + .p2align 4 +L(more_8x_vec): + /* Set end of s1 in rdx. */ + leaq -(VEC_SIZE * 4)(%rdi, %rdx), %rdx + /* rsi stores s2 - s1. This allows loop to only update one pointer. + */ + subq %rdi, %rsi + /* Align s1 pointer. */ + andq $-VEC_SIZE, %rdi + /* Adjust because first 4x vec where check already. */ + subq $-(VEC_SIZE * 4), %rdi + .p2align 4 +L(loop_4x_vec): + VMOVU (%rsi, %rdi), %YMM1 + vpxorq (%rdi), %YMM1, %YMM1 + + VMOVU VEC_SIZE(%rsi, %rdi), %YMM2 + vpxorq VEC_SIZE(%rdi), %YMM2, %YMM2 + + VMOVU (VEC_SIZE * 2)(%rsi, %rdi), %YMM3 + vpxorq (VEC_SIZE * 2)(%rdi), %YMM3, %YMM3 + vpternlogd $0xfe, %YMM1, %YMM2, %YMM3 + + VMOVU (VEC_SIZE * 3)(%rsi, %rdi), %YMM4 + vpternlogd $0xde, (VEC_SIZE * 3)(%rdi), %YMM3, %YMM4 + VPTEST %YMM4, %YMM4, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq2) + subq $-(VEC_SIZE * 4), %rdi + cmpq %rdx, %rdi + jb L(loop_4x_vec) + + subq %rdx, %rdi + VMOVU (VEC_SIZE * 3)(%rsi, %rdx), %YMM4 + vpxorq (VEC_SIZE * 3)(%rdx), %YMM4, %YMM4 + /* rdi has 4 * VEC_SIZE - remaining length. */ + cmpl $(VEC_SIZE * 3), %edi + jae L(8x_last_1x_vec) + /* Load regardless of branch. */ + VMOVU (VEC_SIZE * 2)(%rsi, %rdx), %YMM3 + /* Ternary logic to xor (VEC_SIZE * 2)(%rdx) with YMM3 while oring + with YMM4. Result is stored in YMM4. */ + vpternlogd $0xf6, (VEC_SIZE * 2)(%rdx), %YMM3, %YMM4 + cmpl $(VEC_SIZE * 2), %edi + jae L(8x_last_2x_vec) + + VMOVU VEC_SIZE(%rsi, %rdx), %YMM2 + vpxorq VEC_SIZE(%rdx), %YMM2, %YMM2 + + VMOVU (%rsi, %rdx), %YMM1 + vpxorq (%rdx), %YMM1, %YMM1 + + vpternlogd $0xfe, %YMM1, %YMM2, %YMM4 +L(8x_last_1x_vec): +L(8x_last_2x_vec): + VPTEST %YMM4, %YMM4, %k1 + kmovd %k1, %eax +L(return_neq2): + ret + + /* Relatively cold case as page cross are unexpected. */ + .p2align 4 +L(page_cross_less_vec): + cmpl $16, %edx + jae L(between_16_31) + cmpl $8, %edx + ja L(between_9_15) + cmpl $4, %edx + jb L(between_2_3) + /* From 4 to 8 bytes. No branch when size == 4. */ + movl (%rdi), %eax + movl (%rsi), %ecx + subl %ecx, %eax + movl -4(%rdi, %rdx), %ecx + movl -4(%rsi, %rdx), %esi + subl %esi, %ecx + orl %ecx, %eax + ret + + .p2align 4,, 8 +L(between_9_15): + /* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe. + */ + vmovq (%rdi), %xmm1 + vmovq (%rsi), %xmm2 + vpcmpeqb %xmm1, %xmm2, %xmm3 + vmovq -8(%rdi, %rdx), %xmm1 + vmovq -8(%rsi, %rdx), %xmm2 + vpcmpeqb %xmm1, %xmm2, %xmm2 + vpand %xmm2, %xmm3, %xmm3 + vpmovmskb %xmm3, %eax + subl $0xffff, %eax + /* No ymm register was touched. */ + ret + + .p2align 4,, 8 +L(between_16_31): + /* From 16 to 31 bytes. No branch when size == 16. */ + + /* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe. + */ + vmovdqu (%rsi), %xmm1 + vpcmpeqb (%rdi), %xmm1, %xmm1 + vmovdqu -16(%rsi, %rdx), %xmm2 + vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2 + vpand %xmm1, %xmm2, %xmm2 + vpmovmskb %xmm2, %eax + subl $0xffff, %eax + /* No ymm register was touched. */ + ret + + .p2align 4,, 8 +L(between_2_3): + /* From 2 to 3 bytes. No branch when size == 2. */ + movzwl (%rdi), %eax + movzwl (%rsi), %ecx + subl %ecx, %eax + movzbl -1(%rdi, %rdx), %edi + movzbl -1(%rsi, %rdx), %esi + subl %edi, %esi + orl %esi, %eax + /* No ymm register was touched. */ + ret +END (BCMP) +#endif diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h index f94516e5ee..51f251d0c9 100644 --- a/sysdeps/x86_64/multiarch/ifunc-bcmp.h +++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h @@ -35,8 +35,7 @@ IFUNC_SELECTOR (void) && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) { if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL) - && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW) - && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)) + && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)) return OPTIMIZE (evex); if (CPU_FEATURE_USABLE_P (cpu_features, RTM)) diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index cda0316928..abbb4e407f 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -52,7 +52,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL_ADD (array, i, bcmp, (CPU_FEATURE_USABLE (AVX512VL) && CPU_FEATURE_USABLE (AVX512BW) - && CPU_FEATURE_USABLE (MOVBE) && CPU_FEATURE_USABLE (BMI2)), __bcmp_evex) IFUNC_IMPL_ADD (array, i, bcmp, CPU_FEATURE_USABLE (SSE4_1),