[v3,neleai/string-x64] Improve memcmp performance and fix regression.

On Fri, Jun 19, 2015 at 05:53:04PM +0200, Ondřej Bílka wrote:
> On Thu, Jun 18, 2015 at 10:09:10AM +0200, Ondřej Bílka wrote:
> > Hi,
> > 
> > As I sumbitted before in 2013 memcmp improvement here is new version
> > that improves performance a bit more.
> > 
> > Also when I browsed results I found that memcmp-sse4 is in fact
> > regression for i7 nehalem, ivy bridge and haswell architectures. There
> > its beaten by old sse2 code by more than 10%.
> >

Also when I tried different headers to see if I could improve avx2
version. It turned out that byte-by-byte loop that I use for crosspage
case is best. If I always use that it beats sse4 version on gcc
workload.

Main problem is that branch misprediction kills performance and I
couldn't make decision about n fast.

> > Main idea of new implementation is same, problem with performance is
> > that lot inputs were identical with small n. 
> > For that I found that following approach gives best performance when 
> > n<64 is likely.
> > 
> > if (!cross_page (s1) && !cross_page (s2))
> >   {
> >     mask = get_mask(EQ(EQ(LOAD(s1),LOAD(s2)),zero))
> >     mask2 = mask & (2 << (n-1));
> >     if (mask2)
> >       return s1[first_byte(mask2)]-s2[first_byte(mask2)];
> >     if (n<=16)
> >       return 0;
> >     mask |= get_mask(EQ(EQ(LOAD(s1+16),LOAD(s2+16)),zero)) << 16;
> >     mask |= get_mask(EQ(EQ(LOAD(s1+16),LOAD(s2+16)),zero)) << 32;
> >     mask |= get_mask(EQ(EQ(LOAD(s1+16),LOAD(s2+16)),zero)) << 48;
> >     mask2 = mask & (2 << (n-1));
> >     if (mask2)
> >       return s1[first_byte(mask2)]-s2[first_byte(mask2)];
> >     if (n<=64)                        
> >       return 0;
> >     if (mask)
> >       return s1[first_byte(mask)]-s2[first_byte(mask)];
> >   }
> > 
> > I didn't checked yet using just registers and byteswap to eliminate need
> > of getting exact byte position as I wrote in related thread.
> > 
> > I could improve this bit more, I lose lot of cycles in loop ending
> > conditions. Problem is that I need to handle that unaligned s2 may read
> > from next page, I would need to add more complicated logic to compute
> > number of loop iterations.
> > 
> > Thats related to avx2. I as RFC included it but it harm performance on
> > haswell.
> > 
> > Last is wmemcmp that I would also need to convert, now I just moved
> > memcmp-sse-4 there.
> > 
> > A profile is found here.
> > 
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/memcmp_profile.html
> > 
> I updated that new version. I removed avx2 for now, I will submit it
> when I find how it could improve performance.
> 
> Second change is that I added wmemcmp conditionals so now I could delete
> memcmp-sse4 and wmemcmp-sse4.
> 
> 
After finding out bts trick for strncmp I also tried to use it in
memcmp. Problem is that in memcmp my previous control flow was better as
for memcmp its likely that arguments are equal so I save cost of bsf and
comparing bytes.

Only improvement was that using bts with same control flow saves few
cycles making around 2% improvement for gcc workload.

Also in cross-page case only optimization was to unroll a byte-by-byte
loop as switching to bigger comparison caused more overhead than saved.

So what about following version?

          * sysdeps/x86_64/memcmp.S: New implementation.
          * sysdeps/x86_64/multiarch/ifunc-impl-list.c
  	 (__libc_ifunc_impl_list): Remove memcmp-sse4
          * sysdeps/x86_64/multiarch/Makefile(routines): Remove memcmp-sse4.
          * sysdeps/x86_64/multiarch/memcmp.S: Likewise.
 	 * sysdeps/x86_64/multiarch/memcmp-sse4.S: Removed.
          * sysdeps/x86_64/multiarch/wmemcmp-sse4.S: Likewise.
> 

---
 sysdeps/x86_64/memcmp.S                          |  512 +++----
 sysdeps/x86_64/multiarch/Makefile                |    6 +-
 sysdeps/x86_64/multiarch/ifunc-impl-list.c       |    9 +-
 sysdeps/x86_64/multiarch/memcmp-avx2.S           |    3 +
 sysdeps/x86_64/multiarch/memcmp-sse4.S           | 1776 ----------------------
 sysdeps/x86_64/multiarch/memcmp.S                |   25 +-
 sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S |    9 +-
 sysdeps/x86_64/multiarch/wmemcmp-sse4.S          |    4 -
 sysdeps/x86_64/multiarch/wmemcmp.S               |   12 +-
 9 files changed, 221 insertions(+), 2135 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/memcmp-avx2.S
 delete mode 100644 sysdeps/x86_64/multiarch/memcmp-sse4.S
 delete mode 100644 sysdeps/x86_64/multiarch/wmemcmp-sse4.S

[v3,neleai/string-x64] Improve memcmp performance and fix regression.

Commit Message

Comments

Patch