Message ID | CAGQ9bdw=-pzbi00gn5t_W8pXBjU0edFvLGMNUc+o=g9og=f9+Q@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Tue, May 20, 2014 at 06:23:40PM +0400, Konstantin Serebryany wrote: > Do you think such a benchmarks is possible at all? > I've made a simple test (attached) and profiled it with "perf": > 85.26% a.out libcrypt-2.15.so [.] _ufc_doit_r > 6.13% a.out libcrypt-2.15.so [.] _ufc_mk_keytab_r > 2.40% a.out libcrypt-2.15.so [.] _ufc_setup_salt_r > 1.56% a.out libcrypt-2.15.so [.] __crypt_r > 1.51% a.out libcrypt-2.15.so [.] _ufc_output_conversion_r > 1.35% a.out libcrypt-2.15.so [.] crypt > > As you can see, crypt_r itself takes a tiny fraction of time, > most of it is spent in _ufc_* which are defined in another module. > Any changes in crypt itself (unless you do something insane) will not > be observable in profile. You'll need to choose inputs so that __md5_crypt_r is called - crypt/crypt-entry.c should help you with that. if __md5_crypt_r takes a tiny fraction of time again, then this change should be safe, but I would prefer that the inputs get fed into the benchtests anyway so that we can automatically track performance of crypt for those inputs in future. Siddhesh
On Tue, 20 May 2014, Siddhesh Poyarekar wrote: > On Tue, May 20, 2014 at 06:23:40PM +0400, Konstantin Serebryany wrote: > > Do you think such a benchmarks is possible at all? > > I've made a simple test (attached) and profiled it with "perf": > > 85.26% a.out libcrypt-2.15.so [.] _ufc_doit_r > > 6.13% a.out libcrypt-2.15.so [.] _ufc_mk_keytab_r > > 2.40% a.out libcrypt-2.15.so [.] _ufc_setup_salt_r > > 1.56% a.out libcrypt-2.15.so [.] __crypt_r > > 1.51% a.out libcrypt-2.15.so [.] _ufc_output_conversion_r > > 1.35% a.out libcrypt-2.15.so [.] crypt > > > > As you can see, crypt_r itself takes a tiny fraction of time, > > most of it is spent in _ufc_* which are defined in another module. > > Any changes in crypt itself (unless you do something insane) will not > > be observable in profile. > > You'll need to choose inputs so that __md5_crypt_r is called - > crypt/crypt-entry.c should help you with that. if __md5_crypt_r takes > a tiny fraction of time again, then this change should be safe, but I > would prefer that the inputs get fed into the benchtests anyway so > that we can automatically track performance of crypt for those inputs > in future. Another thing that can be done is to look at the code generated for the relevant file before and after the patch - I'd expect very little change.
On Tue, May 20, 2014 at 02:54:41PM +0000, Joseph S. Myers wrote: > On Tue, 20 May 2014, Siddhesh Poyarekar wrote: > > > On Tue, May 20, 2014 at 06:23:40PM +0400, Konstantin Serebryany wrote: > > > Do you think such a benchmarks is possible at all? > > > I've made a simple test (attached) and profiled it with "perf": > > > 85.26% a.out libcrypt-2.15.so [.] _ufc_doit_r > > > 6.13% a.out libcrypt-2.15.so [.] _ufc_mk_keytab_r > > > 2.40% a.out libcrypt-2.15.so [.] _ufc_setup_salt_r > > > 1.56% a.out libcrypt-2.15.so [.] __crypt_r > > > 1.51% a.out libcrypt-2.15.so [.] _ufc_output_conversion_r > > > 1.35% a.out libcrypt-2.15.so [.] crypt > > > > > > As you can see, crypt_r itself takes a tiny fraction of time, > > > most of it is spent in _ufc_* which are defined in another module. > > > Any changes in crypt itself (unless you do something insane) will not > > > be observable in profile. > > > > You'll need to choose inputs so that __md5_crypt_r is called - > > crypt/crypt-entry.c should help you with that. if __md5_crypt_r takes > > a tiny fraction of time again, then this change should be safe, but I > > would prefer that the inputs get fed into the benchtests anyway so > > that we can automatically track performance of crypt for those inputs > > in future. > > Another thing that can be done is to look at the code generated for the > relevant file before and after the patch - I'd expect very little change. > Yes I wanted to write that benchmarking this change is probably pointless as gcc should optimize them in same way (if there is consistently difference in one way we could report that as gcc regression.) Also perf data show that its a cold function so you should improve readability not performance (measuring difference is possible but you need to run that for few days and wait until standard deviation drops sufficiently to get statistically significant result). About only way this could make difference is if when it gets inlined in one way and does not in other. That is out of scope of microbenchmarks, instruction cache misses easily could make a function slower in wild but you would measure that inlined function is faster when you would call it in tight loop. As actual change is concerned
On Tue, May 20, 2014 at 6:46 PM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > On Tue, May 20, 2014 at 06:23:40PM +0400, Konstantin Serebryany wrote: >> Do you think such a benchmarks is possible at all? >> I've made a simple test (attached) and profiled it with "perf": >> 85.26% a.out libcrypt-2.15.so [.] _ufc_doit_r >> 6.13% a.out libcrypt-2.15.so [.] _ufc_mk_keytab_r >> 2.40% a.out libcrypt-2.15.so [.] _ufc_setup_salt_r >> 1.56% a.out libcrypt-2.15.so [.] __crypt_r >> 1.51% a.out libcrypt-2.15.so [.] _ufc_output_conversion_r >> 1.35% a.out libcrypt-2.15.so [.] crypt >> >> As you can see, crypt_r itself takes a tiny fraction of time, >> most of it is spent in _ufc_* which are defined in another module. >> Any changes in crypt itself (unless you do something insane) will not >> be observable in profile. > > You'll need to choose inputs so that __md5_crypt_r is called - > crypt/crypt-entry.c should help you with that. With salt == "$1$" I get this profile: 84.70% a.out libcrypt-2.15.so [.] __md5_process_block 5.98% a.out libc-2.15.so [.] __memcpy_ssse3_back 4.71% a.out libcrypt-2.15.so [.] __md5_process_bytes 1.92% a.out libcrypt-2.15.so [.] __md5_crypt_r 1.62% a.out libcrypt-2.15.so [.] __md5_finish_ctx Again, __md5_crypt_r is within noise. >> Another thing that can be done is to look at the code generated for the >> relevant file before and after the patch - I'd expect very little change. I like this method much more, but there is actually a change. The original binary has calls to b64_from_24bit.7858 (the nested function), while in the binary built with my patch these calls are inlined. Of course, this is a feature of a particular version of GCC, not of the glibc code. > if __md5_crypt_r takes > a tiny fraction of time again, then this change should be safe, but I > would prefer that the inputs get fed into the benchtests anyway so > that we can automatically track performance of crypt for those inputs > in future. > > Siddhesh
On Tue, May 20, 2014 at 05:20:31PM +0200, Ondřej Bílka wrote: > Yes I wanted to write that benchmarking this change is probably > pointless as gcc should optimize them in same way (if there is > consistently difference in one way we could report that as gcc regression.) > > Also perf data show that its a cold function so you should improve > readability not performance (measuring difference is possible but you > need to run that for few days and wait until standard deviation drops > sufficiently to get statistically significant result). > > About only way this could make difference is if when it gets inlined in > one way and does not in other. > That is out of scope of microbenchmarks, instruction cache misses easily > could make a function slower in wild but you would measure that inlined > function is faster when you would call it in tight loop. Fair enough. I'll push the patch if Konstantin shows that the generated code is not drastically different. Siddhesh
On Tue, May 20, 2014 at 7:27 PM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > On Tue, May 20, 2014 at 05:20:31PM +0200, Ondřej Bílka wrote: >> Yes I wanted to write that benchmarking this change is probably >> pointless as gcc should optimize them in same way (if there is >> consistently difference in one way we could report that as gcc regression.) >> >> Also perf data show that its a cold function so you should improve >> readability not performance (measuring difference is possible but you >> need to run that for few days and wait until standard deviation drops >> sufficiently to get statistically significant result). >> >> About only way this could make difference is if when it gets inlined in >> one way and does not in other. >> That is out of scope of microbenchmarks, instruction cache misses easily >> could make a function slower in wild but you would measure that inlined >> function is faster when you would call it in tight loop. > > Fair enough. I'll push the patch if Konstantin shows that the > generated code is not drastically different. Just a data point; I am not sure if it proves anything. If I add __attribute__ ((noinline)) to the definition of b64_from_24bit, the two binaries become more similar. The non-nested variant is a bit longer as it needs 6 instructions to pass the parameters: 15a9: 0f b6 8d 87 fe ff ff movzbl -0x179(%rbp),%ecx 15b0: 0f b6 95 81 fe ff ff movzbl -0x17f(%rbp),%edx 15b7: 41 b9 04 00 00 00 mov $0x4,%r9d 15bd: 44 0f b6 85 8d fe ff movzbl -0x173(%rbp),%r8d 15c4: ff 15c5: 48 89 de mov %rbx,%rsi 15c8: 4c 89 f7 mov %r14,%rdi 15cb: e8 40 fb ff ff callq 1110 <b64_from_24bit> The nested variant needs 5: 15bf: 0f b6 95 7e fe ff ff movzbl -0x182(%rbp),%edx 15c6: 0f b6 b5 78 fe ff ff movzbl -0x188(%rbp),%esi 15cd: 49 89 da mov %rbx,%r10 15d0: 0f b6 bd 72 fe ff ff movzbl -0x18e(%rbp),%edi 15d7: b9 04 00 00 00 mov $0x4,%ecx 15dc: e8 2f fb ff ff callq 1110 <b64_from_24bit.7858> With inlining (that happens for me only in the non-nested variant) this is irrelevant. > > Siddhesh
On Tue, May 20, 2014 at 07:25:52PM +0400, Konstantin Serebryany wrote: > With salt == "$1$" I get this profile: > 84.70% a.out libcrypt-2.15.so [.] __md5_process_block > 5.98% a.out libc-2.15.so [.] __memcpy_ssse3_back > 4.71% a.out libcrypt-2.15.so [.] __md5_process_bytes > 1.92% a.out libcrypt-2.15.so [.] __md5_crypt_r > 1.62% a.out libcrypt-2.15.so [.] __md5_finish_ctx > > Again, __md5_crypt_r is within noise. That's good. > > >> Another thing that can be done is to look at the code generated for the > >> relevant file before and after the patch - I'd expect very little change. > > I like this method much more, but there is actually a change. > The original binary has calls to b64_from_24bit.7858 (the nested function), > while in the binary built with my patch these calls are inlined. > Of course, this is a feature of a particular version of GCC, not of > the glibc code. Given that this does not have a very big impact, I'm inclined to accepting this change. I'll push it in on Thursday after noon UTC if nobody objects. Siddhesh
On Tue, May 20, 2014 at 7:46 PM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > On Tue, May 20, 2014 at 07:25:52PM +0400, Konstantin Serebryany wrote: >> With salt == "$1$" I get this profile: >> 84.70% a.out libcrypt-2.15.so [.] __md5_process_block >> 5.98% a.out libc-2.15.so [.] __memcpy_ssse3_back >> 4.71% a.out libcrypt-2.15.so [.] __md5_process_bytes >> 1.92% a.out libcrypt-2.15.so [.] __md5_crypt_r >> 1.62% a.out libcrypt-2.15.so [.] __md5_finish_ctx >> >> Again, __md5_crypt_r is within noise. > > That's good. > >> >> >> Another thing that can be done is to look at the code generated for the >> >> relevant file before and after the patch - I'd expect very little change. >> >> I like this method much more, but there is actually a change. >> The original binary has calls to b64_from_24bit.7858 (the nested function), >> while in the binary built with my patch these calls are inlined. >> Of course, this is a feature of a particular version of GCC, not of >> the glibc code. > > Given that this does not have a very big impact, I'm inclined to > accepting this change. I'll push it in on Thursday after noon UTC if > nobody objects. I realized that two other files have the same situation with the function b64_from_24bit. md5-crypt.c, sha256-crypt.c, and sha512-crypt.c have exactly the same implementations of b64_from_24bit as a nested function. One possible fix is to move those two out and make them regular static functions, just like I did in md5-crypt.c. Another solution is to create just one implementation of b64_from_24bit and place it somewhere else. For example declare it as __b64_from_24bit in crypt-private.h and implement it in crypt_util.c. WDYT?
On Wed, May 21, 2014 at 03:35:43PM +0400, Konstantin Serebryany wrote: > I realized that two other files have the same situation with the > function b64_from_24bit. > md5-crypt.c, sha256-crypt.c, and sha512-crypt.c have exactly the same > implementations of b64_from_24bit as a nested function. > One possible fix is to move those two out and make them regular static > functions, just like I did in md5-crypt.c. > > Another solution is to create just one implementation of > b64_from_24bit and place it somewhere else. > For example declare it as __b64_from_24bit in crypt-private.h and > implement it in crypt_util.c. > > WDYT? One implementation would obviously be better. Siddhesh
On Wed, May 21, 2014 at 3:47 PM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > On Wed, May 21, 2014 at 03:35:43PM +0400, Konstantin Serebryany wrote: >> I realized that two other files have the same situation with the >> function b64_from_24bit. >> md5-crypt.c, sha256-crypt.c, and sha512-crypt.c have exactly the same >> implementations of b64_from_24bit as a nested function. >> One possible fix is to move those two out and make them regular static >> functions, just like I did in md5-crypt.c. >> >> Another solution is to create just one implementation of >> b64_from_24bit and place it somewhere else. >> For example declare it as __b64_from_24bit in crypt-private.h and >> implement it in crypt_util.c. >> >> WDYT? > > One implementation would obviously be better. Let me prepare another patch then. Does __b64_from_24bit declared in crypt-private.h and defined in crypt_util.c sound good? > > Siddhesh
On Wed, May 21, 2014 at 03:48:32PM +0400, Konstantin Serebryany wrote: > Let me prepare another patch then. Let this patch go in as is. Move the function out with the patch to sha256-crypt.c or whatever you're changing next, assuming that this patch is in. > Does __b64_from_24bit declared in crypt-private.h and defined in > crypt_util.c sound good? Yes. Siddhesh
On Tue, May 20, 2014 at 06:23:40PM +0400, Konstantin Serebryany wrote: > 2014-05-20 Kostya Serebryany <konstantin.s.serebryany@gmail.com> > > * crypt/md5-crypt.c (__md5_crypt_r): Remove a nested function. > (b64_from_24bit): New function. > I have pushed this (and the formatting fix) now. Siddhesh
diff --git a/crypt/md5-crypt.c b/crypt/md5-crypt.c index d1b92d7..5d5fc55 100644 --- a/crypt/md5-crypt.c +++ b/crypt/md5-crypt.c @@ -88,6 +88,20 @@ extern char *__md5_crypt_r (const char *key, const char *salt, char *buffer, int buflen); extern char *__md5_crypt (const char *key, const char *salt); +static void +b64_from_24bit (char **cp, int *buflen, + unsigned int b2, unsigned int b1, unsigned int b0, + int n) +{ + unsigned int w = (b2 << 16) | (b1 << 8) | b0; + while (n-- > 0 && *buflen > 0) + { + *(*cp)++ = b64t[w & 0x3f]; + --*buflen; + w >>= 6; + } +} + /* This entry point is equivalent to the `crypt' function in Unix