Message ID | 20150717004134.GA30504@domone |
---|---|
State | New |
Headers | show |
On 07/17/2015 06:11 AM, Ondřej Bílka wrote: > On Thu, Jul 16, 2015 at 06:05:10PM -0300, Tulio Magno Quites Machado Filho wrote: >> "Carlos O'Donell" <carlos@redhat.com> writes: >> >>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote: >>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at >>>> all. I did that as test how much we could test IBM to verify patches. >> >> No. That isn't true. >> I reviewed this patch myself. All the 4 versions. >> So did Steven. >> > I did see only three. Also I didn't see you mails with comments on > libc-alpha. Please post these with comments so everybody could see > these. > > If I were you then I would deny that I reviewed it as that cast bad > light on you. There are numerous flaws on these patches and that both of > you didn't catch bug shows it. > >>>> I pointed out that it could have possibly quadratic behaviour which >>>> still does. So please don't accept unreviewed patches next time. >> >> Yes, this patch does have a quadratic worst case as does quicksort. >> But for both algorithms, they're clearly an improvement over what was >> available before them. >> Is this algorithm perfect? No. I don't believe such a thing exist. I >> believe in iterative improvements. >> > That isn't clear improvement. As patch history versions 1 and 2 were bad > algorithm which took me lot of time to convince change. Then this is > better but its just my patch for that poorly rewriten. I clearly see > problems there, for example calling strnlen is unnecessary overhead. > > One thing is iterative change, second is wasted effort. If code needs to > be completely replaced to improve performance then its better not to > send a patch at all. If a assembly is slower than equivalent c > impementation then its pointless to use assembly. So why should you keep > custom assembly when generic implementation is better? > >>> They showed cases for which the code does go faster and objectively >>> so using the microbenchmark, and that's a win for now. Please continue >>> to work with IBM to remove the quadratic worst case. >> >> That's our intention. >> An improvement to this implementation is already in our backlog (i.e. we >> haven't started yet) and I'd be glad if someone is interested to help us with >> this. >> > That is simple as my generic implementation is both simpler and faster > and handles this. If you didn't waste time with assembly then you > wouldn't have backlog. > > >>> Tulio, You will need to work out why you have quadratic worst case. >>> It's certainly something we try to avoid. Did you make a particular >>> decision not to avoid it? >> >> IMHO, the overall improvement brought by this patch is more important than >> the worst case scenario Ondřej is showing. >> Which doesn't mean we're neglecting the quadratic worst case. >> > As this was also negatively received on x64, see bug 12100 pointing out > to fix that should every reviewer do. > > And as generic case I many times asked if you ran my generic strstr > patch. It was lot faster than first two iteration and in practice looks I have run benchtest against your generic strstr patches and the results are attached in https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html where 12750 out of 13442 combinations shows improvement. > twice faster than this assembly. Thats because strings are generaly > short. Its bit slower than assembly or large size. Thats just caused by > using ppc strchr instead power7 one which is slower. So I just used two > versions. > > Results with modified benchmark attached, where I did hack to rename my > proposal as strstr_power7b while strstr_power7 is current to clearly see > difference. > How did you manage to get both __strstr_power7b and __strstr_power7 results in the same benchtest run? or Did you run two different benchtests and merge the output? > So if you want better performance and fix quadratic behaviour what about > following? > > * benchtests/bench-strstr.c: Modify benchmark. > * sysdeps/powerpc/powerpc64/multiarch/Makefile (routines): Add strstr-ppc64. > * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > (strstr_ppc, strstr_power7: Add. > * sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c: New file. > * sysdeps/powerpc/powerpc64/multiarch/strstr.c: Likewise. >
On Fri, Jul 17, 2015 at 04:14:53PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > On 07/17/2015 06:11 AM, Ondřej Bílka wrote: > >On Thu, Jul 16, 2015 at 06:05:10PM -0300, Tulio Magno Quites Machado Filho wrote: > > >And as generic case I many times asked if you ran my generic strstr > >patch. It was lot faster than first two iteration and in practice looks > I have run benchtest against your generic strstr patches > and the results are attached in > https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html > where 12750 out of 13442 combinations shows improvement. > And I was repeatedly telling you that your benchmark is invalid. You compare generic digraph search with power7 strchr. That has two problems. First one is that I previously described that reason of using digraphs instead of just strchr loop is better performance in more degenerated cases. Your benchmarks just shows that checking for one character is better than for two when probability of that character is very low. So to test these you need to use different benchmark where you set probabilities of first character to 1/8, 1/16, 1/32, 1/64 and you will see that on some point digraphs beat single character. From my previous mail: " Just use same patch like I send with ((unsigned) rand())%16 + 1 and you will see completely different numbers in benchmark. This is real problem, not just pathological case as it could easily happen in some programs. If you search english text for strstr(haystack," the") then on average every fourth character is space. " As second problem I was repeatedly telling you that you need compare power7 optimized code with power7 one. So either add powerpc instructions to skeleton or use my previous patch. As I said that my patch would beat yours it holds as that applied to strstr v2. Then I realized this problem and said that you should check naive loop which you didn't do. Following comment from my previous mail still applies: " My comment was that it wastes time most of time as in common case these already differ in first byte so all these shifts were completely unnecessary, a simple optimization of naive algorithm using while(s=strchr(s,n[0])) would be faster. " Which I also paraphrased here which applies to your v3. " Second is about average case perforance. There worst case is irrelevant as you use buy-or-rent techniques to handle it. I sumbitted following patch with easy way to handle that. [PATCH] Improve generic strstr performance. its currently superseeded by generic vector bigraph search but it should beat your strstr. " > >twice faster than this assembly. Thats because strings are generaly > >short. Its bit slower than assembly or large size. Thats just caused by > >using ppc strchr instead power7 one which is slower. So I just used two > >versions. > > > >Results with modified benchmark attached, where I did hack to rename my > >proposal as strstr_power7b while strstr_power7 is current to clearly see > >difference. > > > How did you manage to get both __strstr_power7b and __strstr_power7 > results in the same benchtest run? or Did you run two different > benchtests and merge the output? > Trivial. 1. Just add new implementation after patching without deleting old one, name it __strstr_power7b 2. Add __strstr_power7b to ifunc_impl_list 3. Run benchmark
On 17-07-2015 08:53, Ondřej Bílka wrote: > On Fri, Jul 17, 2015 at 04:14:53PM +0530, Rajalakshmi Srinivasaraghavan wrote: >> >> >> On 07/17/2015 06:11 AM, Ondřej Bílka wrote: >>> On Thu, Jul 16, 2015 at 06:05:10PM -0300, Tulio Magno Quites Machado Filho wrote: > > >>> And as generic case I many times asked if you ran my generic strstr >>> patch. It was lot faster than first two iteration and in practice looks >> I have run benchtest against your generic strstr patches >> and the results are attached in >> https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html >> where 12750 out of 13442 combinations shows improvement. >> > And I was repeatedly telling you that your benchmark is invalid. You > compare generic digraph search with power7 strchr. That has two > problems. > > First one is that I previously described that reason of using digraphs instead > of just strchr loop is better performance in more degenerated cases. > Your benchmarks just shows that checking for one character is better > than for two when probability of that character is very low. > > So to test these you need to use different benchmark where you set > probabilities of first character to 1/8, 1/16, 1/32, 1/64 and you will > see that on some point digraphs beat single character. > > From my previous mail: > > " > Just use same patch like I send with ((unsigned) rand())%16 + 1 and you > will see completely different numbers in benchmark. > > This is real problem, not just pathological case as it could easily > happen in some programs. If you search english text for > strstr(haystack," the") then on average every fourth character is space. > " > > As second problem I was repeatedly telling you that you need compare > power7 optimized code with power7 one. So either add powerpc > instructions to skeleton or use my previous patch. As I said that my > patch would beat yours it holds as that applied to strstr v2. Then I > realized this problem and said that you should check naive loop which > you didn't do. > > Following comment from my previous mail still applies: > > " > My comment was that it wastes time most of time as in common case these > already differ in first byte so all these shifts were completely > unnecessary, a simple optimization of naive algorithm using > > while(s=strchr(s,n[0])) > > would be faster. > " > > Which I also paraphrased here which applies to your v3. > > " > Second is about average case perforance. There worst case is irrelevant > as you use buy-or-rent techniques to handle it. I sumbitted following > patch with easy way to handle that. > > [PATCH] Improve generic strstr performance. > > its currently superseeded by generic vector bigraph search but it should > beat your strstr. > " > >>> twice faster than this assembly. Thats because strings are generaly >>> short. Its bit slower than assembly or large size. Thats just caused by >>> using ppc strchr instead power7 one which is slower. So I just used two >>> versions. >>> >>> Results with modified benchmark attached, where I did hack to rename my >>> proposal as strstr_power7b while strstr_power7 is current to clearly see >>> difference. >>> >> How did you manage to get both __strstr_power7b and __strstr_power7 >> results in the same benchtest run? or Did you run two different >> benchtests and merge the output? >> > Trivial. > 1. Just add new implementation after patching without deleting old one, name it __strstr_power7b > 2. Add __strstr_power7b to ifunc_impl_list > 3. Run benchmark > I would be very good if we could work on improve the strstr benchmark to shows this issues you just raised.
On Fri, Jul 17, 2015 at 09:13:59AM -0300, Adhemerval Zanella wrote: > > > On 17-07-2015 08:53, Ondřej Bílka wrote: > > On Fri, Jul 17, 2015 at 04:14:53PM +0530, Rajalakshmi Srinivasaraghavan wrote: > >> > >> > >> On 07/17/2015 06:11 AM, Ondřej Bílka wrote: > >>> On Thu, Jul 16, 2015 at 06:05:10PM -0300, Tulio Magno Quites Machado Filho wrote: > > > > >>> And as generic case I many times asked if you ran my generic strstr > >>> patch. It was lot faster than first two iteration and in practice looks > >> I have run benchtest against your generic strstr patches > >> and the results are attached in > >> https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html > >> where 12750 out of 13442 combinations shows improvement. > >> > > And I was repeatedly telling you that your benchmark is invalid. You > > compare generic digraph search with power7 strchr. That has two > > problems. > > > > First one is that I previously described that reason of using digraphs instead > > of just strchr loop is better performance in more degenerated cases. > > Your benchmarks just shows that checking for one character is better > > than for two when probability of that character is very low. > > > > So to test these you need to use different benchmark where you set > > probabilities of first character to 1/8, 1/16, 1/32, 1/64 and you will > > see that on some point digraphs beat single character. > > > > From my previous mail: > > > > " > > Just use same patch like I send with ((unsigned) rand())%16 + 1 and you > > will see completely different numbers in benchmark. > > > > This is real problem, not just pathological case as it could easily > > happen in some programs. If you search english text for > > strstr(haystack," the") then on average every fourth character is space. > > " > > > > As second problem I was repeatedly telling you that you need compare > > power7 optimized code with power7 one. So either add powerpc > > instructions to skeleton or use my previous patch. As I said that my > > patch would beat yours it holds as that applied to strstr v2. Then I > > realized this problem and said that you should check naive loop which > > you didn't do. > > > > Following comment from my previous mail still applies: > > > > " > > My comment was that it wastes time most of time as in common case these > > already differ in first byte so all these shifts were completely > > unnecessary, a simple optimization of naive algorithm using > > > > while(s=strchr(s,n[0])) > > > > would be faster. > > " > > > > Which I also paraphrased here which applies to your v3. > > > > " > > Second is about average case perforance. There worst case is irrelevant > > as you use buy-or-rent techniques to handle it. I sumbitted following > > patch with easy way to handle that. > > > > [PATCH] Improve generic strstr performance. > > > > its currently superseeded by generic vector bigraph search but it should > > beat your strstr. > > " > > > >>> twice faster than this assembly. Thats because strings are generaly > >>> short. Its bit slower than assembly or large size. Thats just caused by > >>> using ppc strchr instead power7 one which is slower. So I just used two > >>> versions. > >>> > >>> Results with modified benchmark attached, where I did hack to rename my > >>> proposal as strstr_power7b while strstr_power7 is current to clearly see > >>> difference. > >>> > >> How did you manage to get both __strstr_power7b and __strstr_power7 > >> results in the same benchtest run? or Did you run two different > >> benchtests and merge the output? > >> > > Trivial. > > 1. Just add new implementation after patching without deleting old one, name it __strstr_power7b > > 2. Add __strstr_power7b to ifunc_impl_list > > 3. Run benchmark > > > > I would be very good if we could work on improve the strstr benchmark > to shows this issues you just raised. For quadratic case I already send patch in thread. For other issues microbenchmark isn't sufficient as usual. These inputs are not completely random. While I could with lot of effort model these as markov chains it would be wasted work as getting trace is much simpler and more reliable.
diff --git a/benchtests/bench-strstr.c b/benchtests/bench-strstr.c index 74f3ee8..b0904e7 100644 --- a/benchtests/bench-strstr.c +++ b/benchtests/bench-strstr.c @@ -91,24 +91,15 @@ do_test (size_t align1, size_t align2, size_t len1, size_t len2, } s2[len2] = '\0'; - if (fail) - { - char *ss1 = s1; - for (size_t l = len1; l > 0; l = l > dl ? l - dl : 0) - { - size_t t = l > dl ? dl : l; - memcpy (ss1, d, t); - ++ss1[len2 > 7 ? 7 : len2 - 1]; - ss1 += t; - } - } - else - { - memset (s1, '0', len1); - memcpy (s1 + len1 - len2, s2, len2); - } + int i; + for (i=0; i < len1; i++) + s1[i] = (((unsigned)rand()) % 128) + 17; s1[len1] = '\0'; + for (i=0; i < len2; i++) + s2[i]= (((unsigned) rand()) % 128) + 17; + s2[len2] = '\0'; + printf ("Length %4zd/%zd, alignment %2zd/%2zd, %s:", len1, len2, align1, align2, fail ? "fail" : "found"); @@ -128,6 +119,23 @@ test_main (void) printf ("\t%s", impl->name); putchar ('\n'); + + char s1[1000000],s2[2000]; + + memset (s1, 'a', 1000000); + memset (s2, 'a', 2000); + s1[9999] = '\0'; + s2[1998] = 'b'; + s2[1999] = '\0'; + + { + printf ("strstr(\"aaa...aaa\",\"aaa...aaab\")\n"); + FOR_EACH_IMPL (impl, 0) + do_one_test (impl, s1, s2, NULL); + putchar('\n'); + } + + for (size_t klen = 2; klen < 32; ++klen) for (size_t hlen = 2 * klen; hlen < 16 * klen; hlen += klen) { diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index 17265bd..b92edbf 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -19,7 +19,7 @@ sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \ strcmp-power8 strcmp-power7 strcmp-ppc64 \ strcat-power8 strcat-power7 strcat-ppc64 \ memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \ - strncpy-power8 + strncpy-power8 strstr-ppc64 CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index f5fdea5..364385b 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -322,5 +322,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL_ADD (array, i, strcat, 1, __strcat_ppc)) + /* Support sysdeps/powerpc/powerpc64/multiarch/strstr.c. */ + IFUNC_IMPL (i, name, strstr, + IFUNC_IMPL_ADD (array, i, strstr, + hwcap & PPC_FEATURE_HAS_VSX, + __strstr_power7) + IFUNC_IMPL_ADD (array, i, strstr, 1, + __strstr_ppc)) + + return i; } diff --git a/sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c b/sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c new file mode 100644 index 0000000..0ef9712 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c @@ -0,0 +1,72 @@ +/* Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <string.h> + +#define STRSTR __strstr_string +#if IS_IN (libc) && defined(SHARED) +# undef libc_hidden_builtin_def +# define libc_hidden_builtin_def(name) \ + __hidden_ver1(__strstr_ppc, __GI_strstr, __strstr_ppc); +#endif + +extern __typeof (strstr) __strstr_string attribute_hidden; +extern __typeof (strstr) __strstr_ppc attribute_hidden; + +extern char *__strchr_power7 (const char *x, int c); + +char * +__strstr_power7 (char *s, char *n) +{ + if (!n[0]) + return s; + while (1) + { + s = __strchr_power7 (s, n[0]); + if (!s) + return NULL; + size_t i; + for (i=1; s[i] == n[i] && n[i];i++); + if (!n[i]) + return s; + if (i > 4) + return __strstr_string (s + 1, n); + s++; + } +} + +char * +__strstr_ppc (const char *s, const char *n) +{ + if (!n[0]) + return (char *) s; + while (1) + { + s = strchr (s, n[0]); + if (!s) + return NULL; + size_t i; + for (i=1; s[i] == n[i] && n[i];i++); + if (!n[i]) + return (char *) s; + if (i > 4) + return __strstr_string (s + 1, n); + s++; + } +} + +#include <string/strstr.c> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strstr.c b/sysdeps/powerpc/powerpc64/multiarch/strstr.c new file mode 100644 index 0000000..2be7646 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strstr.c @@ -0,0 +1,34 @@ +/* Multiple versions of strstr. PowerPC64 version. + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +/* Define multiple versions only for definition in libc. */ +#if IS_IN (libc) +# include <string.h> +# include <shlib-compat.h> +# include "init-arch.h" + +extern __typeof (strstr) __strstr_ppc attribute_hidden; +extern __typeof (strstr) __strstr_power7 attribute_hidden; + +/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle + ifunc symbol properly. */ +libc_ifunc (strstr, + (hwcap & PPC_FEATURE_HAS_VSX) + ? __strstr_power7 + : __strstr_ppc); +#endif