Message ID | 20150716195538.GA5140@domone |
---|---|
State | New |
Headers | show |
On 07/16/2015 03:55 PM, Ondřej Bílka wrote: > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: >>>> May I proceed with this commit? >>> >>> Yes, please commit this for 2.22. >> >> For the record I trust IBM to make sure these patches make incremental >> improvements in performance even if they are not the best possible >> performance as pointed out by Ondrej Bilka. >> > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at > all. I did that as test how much we could test IBM to verify patches. > > I pointed out that it could have possibly quadratic behaviour which > still does. So please don't accept unreviewed patches next time. They showed cases for which the code does go faster and objectively so using the microbenchmark, and that's a win for now. Please continue to work with IBM to remove the quadratic worst case. Tulio, You will need to work out why you have quadratic worst case. It's certainly something we try to avoid. Did you make a particular decision not to avoid it? Cheers, Carlos.
"Carlos O'Donell" <carlos@redhat.com> writes: > On 07/16/2015 03:55 PM, Ondřej Bílka wrote: >> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at >> all. I did that as test how much we could test IBM to verify patches. No. That isn't true. I reviewed this patch myself. All the 4 versions. So did Steven. >> I pointed out that it could have possibly quadratic behaviour which >> still does. So please don't accept unreviewed patches next time. Yes, this patch does have a quadratic worst case as does quicksort. But for both algorithms, they're clearly an improvement over what was available before them. Is this algorithm perfect? No. I don't believe such a thing exist. I believe in iterative improvements. > They showed cases for which the code does go faster and objectively > so using the microbenchmark, and that's a win for now. Please continue > to work with IBM to remove the quadratic worst case. That's our intention. An improvement to this implementation is already in our backlog (i.e. we haven't started yet) and I'd be glad if someone is interested to help us with this. > Tulio, You will need to work out why you have quadratic worst case. > It's certainly something we try to avoid. Did you make a particular > decision not to avoid it? IMHO, the overall improvement brought by this patch is more important than the worst case scenario Ondřej is showing. Which doesn't mean we're neglecting the quadratic worst case.
On Thu, 16 Jul 2015, Carlos O'Donell wrote: > On 07/16/2015 03:55 PM, Ondřej Bílka wrote: > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: > >>>> May I proceed with this commit? > >>> > >>> Yes, please commit this for 2.22. > >> > >> For the record I trust IBM to make sure these patches make incremental > >> improvements in performance even if they are not the best possible > >> performance as pointed out by Ondrej Bilka. > >> > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at > > all. I did that as test how much we could test IBM to verify patches. > > > > I pointed out that it could have possibly quadratic behaviour which > > still does. So please don't accept unreviewed patches next time. > > They showed cases for which the code does go faster and objectively > so using the microbenchmark, and that's a win for now. Please continue > to work with IBM to remove the quadratic worst case. > > Tulio, You will need to work out why you have quadratic worst case. > It's certainly something we try to avoid. Did you make a particular > decision not to avoid it? If there's a quadratic worst case newly introduced for 2.22, I'd consider that a security hole (denial of service) that needs to block the release of 2.22 until it's fixed (possibly by removing the implementation in question).
On 07/22/2015 12:12 PM, Joseph Myers wrote: > On Thu, 16 Jul 2015, Carlos O'Donell wrote: > >> On 07/16/2015 03:55 PM, Ondřej Bílka wrote: >>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: >>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: >>>>>> May I proceed with this commit? >>>>> >>>>> Yes, please commit this for 2.22. >>>> >>>> For the record I trust IBM to make sure these patches make incremental >>>> improvements in performance even if they are not the best possible >>>> performance as pointed out by Ondrej Bilka. >>>> >>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at >>> all. I did that as test how much we could test IBM to verify patches. >>> >>> I pointed out that it could have possibly quadratic behaviour which >>> still does. So please don't accept unreviewed patches next time. >> >> They showed cases for which the code does go faster and objectively >> so using the microbenchmark, and that's a win for now. Please continue >> to work with IBM to remove the quadratic worst case. >> >> Tulio, You will need to work out why you have quadratic worst case. >> It's certainly something we try to avoid. Did you make a particular >> decision not to avoid it? > > If there's a quadratic worst case newly introduced for 2.22, I'd consider > that a security hole (denial of service) that needs to block the release > of 2.22 until it's fixed (possibly by removing the implementation in > question). Joseph, We have had quadratic worse case in our string routines for years without it blocking a release. I agree that it is not the best case for a release to have such behaviour, but should it block this release? Florian, I would like to hear your opinion on whether you might consider this a security issue, or simply a QoI issue. Tulio, Could you please review the possibility of quadratic behaviour and respond prompty? I don't want this to hold up the release. Cheers, Carlos.
On Wed, 2015-07-22 at 16:12 +0000, Joseph Myers wrote: > On Thu, 16 Jul 2015, Carlos O'Donell wrote: > > > On 07/16/2015 03:55 PM, Ondřej Bílka wrote: > > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: > > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: > > >>>> May I proceed with this commit? > > >>> > > >>> Yes, please commit this for 2.22. > > >> > > >> For the record I trust IBM to make sure these patches make incremental > > >> improvements in performance even if they are not the best possible > > >> performance as pointed out by Ondrej Bilka. > > >> > > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at > > > all. I did that as test how much we could test IBM to verify patches. > > > > > > I pointed out that it could have possibly quadratic behaviour which > > > still does. So please don't accept unreviewed patches next time. > > > > They showed cases for which the code does go faster and objectively > > so using the microbenchmark, and that's a win for now. Please continue > > to work with IBM to remove the quadratic worst case. > > > > Tulio, You will need to work out why you have quadratic worst case. > > It's certainly something we try to avoid. Did you make a particular > > decision not to avoid it? > > If there's a quadratic worst case newly introduced for 2.22, I'd consider > that a security hole (denial of service) that needs to block the release > of 2.22 until it's fixed (possibly by removing the implementation in > question). > There is a denial of service attach here, but not what you would think. Effectively this (noise and FUD) is blocking my team from submitting patches and preventing my team for moving on to additional optimization. The core problem is that there is no object definition of what a quadratic behavior might be, nor has a certain troll that constantly shouts (quadratic! quadratic! quadratic!) about this, provided objective proof, in the form of benchmark. So if you really believe that this is problem, you should insist on a objective benchmark and apply that objective standard to every platform. This would allow my team work on the problem and improve our code even more. Otherwise you are just wasting every ones time.
On 07/22/2015 07:55 PM, Carlos O'Donell wrote: > I would like to hear your opinion on whether you might consider this a > security issue, or simply a QoI issue. Not a security issue as far as I can see.
"Carlos O'Donell" <carlos@redhat.com> writes: > On 07/22/2015 12:12 PM, Joseph Myers wrote: >> On Thu, 16 Jul 2015, Carlos O'Donell wrote: >> >>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote: >>>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: >>>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: >>>>>>> May I proceed with this commit? >>>>>> >>>>>> Yes, please commit this for 2.22. >>>>> >>>>> For the record I trust IBM to make sure these patches make incremental >>>>> improvements in performance even if they are not the best possible >>>>> performance as pointed out by Ondrej Bilka. >>>>> >>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at >>>> all. I did that as test how much we could test IBM to verify patches. >>>> >>>> I pointed out that it could have possibly quadratic behaviour which >>>> still does. So please don't accept unreviewed patches next time. >>> >>> They showed cases for which the code does go faster and objectively >>> so using the microbenchmark, and that's a win for now. Please continue >>> to work with IBM to remove the quadratic worst case. >>> >>> Tulio, You will need to work out why you have quadratic worst case. >>> It's certainly something we try to avoid. Did you make a particular >>> decision not to avoid it? >> >> If there's a quadratic worst case newly introduced for 2.22, I'd consider >> that a security hole (denial of service) that needs to block the release >> of 2.22 until it's fixed (possibly by removing the implementation in >> question). Could you elaborate why a quadratic worst case in a string function can be considered a denial of service, please? > Tulio, > > Could you please review the possibility of quadratic behaviour and respond > prompty? I don't want this to hold up the release. I confirm this algorithm does have a quadratic worst case, which appears if needle <= 2048. If the needle > 2048, it falls back to the previous implementation.
On 07/22/2015 02:29 PM, Tulio Magno Quites Machado Filho wrote: > "Carlos O'Donell" <carlos@redhat.com> writes: > >> On 07/22/2015 12:12 PM, Joseph Myers wrote: >>> On Thu, 16 Jul 2015, Carlos O'Donell wrote: >>> >>>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote: >>>>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: >>>>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: >>>>>>>> May I proceed with this commit? >>>>>>> >>>>>>> Yes, please commit this for 2.22. >>>>>> >>>>>> For the record I trust IBM to make sure these patches make incremental >>>>>> improvements in performance even if they are not the best possible >>>>>> performance as pointed out by Ondrej Bilka. >>>>>> >>>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at >>>>> all. I did that as test how much we could test IBM to verify patches. >>>>> >>>>> I pointed out that it could have possibly quadratic behaviour which >>>>> still does. So please don't accept unreviewed patches next time. >>>> >>>> They showed cases for which the code does go faster and objectively >>>> so using the microbenchmark, and that's a win for now. Please continue >>>> to work with IBM to remove the quadratic worst case. >>>> >>>> Tulio, You will need to work out why you have quadratic worst case. >>>> It's certainly something we try to avoid. Did you make a particular >>>> decision not to avoid it? >>> >>> If there's a quadratic worst case newly introduced for 2.22, I'd consider >>> that a security hole (denial of service) that needs to block the release >>> of 2.22 until it's fixed (possibly by removing the implementation in >>> question). > > Could you elaborate why a quadratic worst case in a string function can be > considered a denial of service, please? > >> Tulio, >> >> Could you please review the possibility of quadratic behaviour and respond >> prompty? I don't want this to hold up the release. > > I confirm this algorithm does have a quadratic worst case, which appears if > needle <= 2048. > If the needle > 2048, it falls back to the previous implementation. While we have striven to remove quadratic worst case from the string ops, there has been policy against it if the machine maintainer, knowing their machine best thinks a particular algorithm with such behaviour works best. It seems like we need some consensus on the requirements of string operations, but such consensus will take sufficient time that 2.22 will be released by then. I am OK to leave the patches in place and to resolve the discussion of quadratic behaviour and algorithm acceptability for 2.23. Cheers, Carlos.
On Wed, 2015-07-22 at 15:29 -0300, Tulio Magno Quites Machado Filho wrote: > "Carlos O'Donell" <carlos@redhat.com> writes: > > > On 07/22/2015 12:12 PM, Joseph Myers wrote: > >> On Thu, 16 Jul 2015, Carlos O'Donell wrote: > >> > >>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote: > >>>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: > >>>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: > >>>>>>> May I proceed with this commit? > >>>>>> > >>>>>> Yes, please commit this for 2.22. > >>>>> > >>>>> For the record I trust IBM to make sure these patches make incremental > >>>>> improvements in performance even if they are not the best possible > >>>>> performance as pointed out by Ondrej Bilka. > >>>>> > >>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at > >>>> all. I did that as test how much we could test IBM to verify patches. > >>>> > >>>> I pointed out that it could have possibly quadratic behaviour which > >>>> still does. So please don't accept unreviewed patches next time. > >>> > >>> They showed cases for which the code does go faster and objectively > >>> so using the microbenchmark, and that's a win for now. Please continue > >>> to work with IBM to remove the quadratic worst case. > >>> > >>> Tulio, You will need to work out why you have quadratic worst case. > >>> It's certainly something we try to avoid. Did you make a particular > >>> decision not to avoid it? > >> > >> If there's a quadratic worst case newly introduced for 2.22, I'd consider > >> that a security hole (denial of service) that needs to block the release > >> of 2.22 until it's fixed (possibly by removing the implementation in > >> question). > > Could you elaborate why a quadratic worst case in a string function can be > considered a denial of service, please? > > > Tulio, > > > > Could you please review the possibility of quadratic behaviour and respond > > prompty? I don't want this to hold up the release. > > I confirm this algorithm does have a quadratic worst case, which appears if > needle <= 2048. > If the needle > 2048, it falls back to the previous implementation. > Well it is somewhat worse, but does this meet the definition of Denial of service, is the debate. Also we can reduce this by changing the cross over point for needle size, but at some loss of overall performance for legitimate cases. This gets back to my request for an objective standard. No one knows.
On Wed, 22 Jul 2015, Carlos O'Donell wrote: > > If there's a quadratic worst case newly introduced for 2.22, I'd consider > > that a security hole (denial of service) that needs to block the release > > of 2.22 until it's fixed (possibly by removing the implementation in > > question). > > Joseph, > > We have had quadratic worse case in our string routines for years without > it blocking a release. I agree that it is not the best case for a release And I believe we established a consensus, when removing the SSE4 implementation (bug 12100), that such implementations are not suitable for inclusion in glibc. > to have such behaviour, but should it block this release? As a denial-of-service regression (for any code that may use strstr on untrusted inputs), yes.
On Wed, 22 Jul 2015, Tulio Magno Quites Machado Filho wrote: > >> If there's a quadratic worst case newly introduced for 2.22, I'd consider > >> that a security hole (denial of service) that needs to block the release > >> of 2.22 until it's fixed (possibly by removing the implementation in > >> question). > > Could you elaborate why a quadratic worst case in a string function can be > considered a denial of service, please? Because code may use strstr on untrusted inputs, meaning small inputs (e.g. MB in size) cause large (effectively infinite) CPU usage - any case where an attacker can cause your resource usage to grow more than linearly with their usage is a denial-of-service problem. Thus gnulib considers such strstr implementations buggy and disables them (so preventing any of your optimizations from being effective for any program using gnulib), but not everything uses gnulib. (There are cases, e.g. untrusted regular expressions, where glibc can't reasonably avoid such resource usage, and so anyone using untrusted regular expressions must apply resource limits externally. But strstr isn't such a case.) > > Tulio, > > > > Could you please review the possibility of quadratic behaviour and respond > > prompty? I don't want this to hold up the release. > > I confirm this algorithm does have a quadratic worst case, which appears if > needle <= 2048. > If the needle > 2048, it falls back to the previous implementation. strstr should be O(m+n) (for needle length m, haystack length n); a naive implementation could be O(mn). Are you saying that, for constant m <= 2048, it's O(n^2)? Because O(mn) for bounded m is fine (the threshold 2048 would preferably have been determined by some sort of benchmark, but the precise choice of threshold isn't something to block a release).
On Wed, 2015-07-22 at 19:41 +0000, Joseph Myers wrote: > On Wed, 22 Jul 2015, Tulio Magno Quites Machado Filho wrote: > > > >> If there's a quadratic worst case newly introduced for 2.22, I'd consider > > >> that a security hole (denial of service) that needs to block the release > > >> of 2.22 until it's fixed (possibly by removing the implementation in > > >> question). > > > > Could you elaborate why a quadratic worst case in a string function can be > > considered a denial of service, please? > > Because code may use strstr on untrusted inputs, meaning small inputs > (e.g. MB in size) cause large (effectively infinite) CPU usage - any case > where an attacker can cause your resource usage to grow more than linearly > with their usage is a denial-of-service problem. Thus gnulib considers > such strstr implementations buggy and disables them (so preventing any of > your optimizations from being effective for any program using gnulib), but > not everything uses gnulib. > > (There are cases, e.g. untrusted regular expressions, where glibc can't > reasonably avoid such resource usage, and so anyone using untrusted > regular expressions must apply resource limits externally. But strstr > isn't such a case.) > > > > Tulio, > > > > > > Could you please review the possibility of quadratic behaviour and respond > > > prompty? I don't want this to hold up the release. > > > > I confirm this algorithm does have a quadratic worst case, which appears if > > needle <= 2048. > > If the needle > 2048, it falls back to the previous implementation. > > strstr should be O(m+n) (for needle length m, haystack length n); a naive > implementation could be O(mn). Are you saying that, for constant m <= > 2048, it's O(n^2)? Because O(mn) for bounded m is fine (the threshold > 2048 would preferably have been determined by some sort of benchmark, but > the precise choice of threshold isn't something to block a release). > This is not an objective standard because real code in the real world does not behave in a mathematically precise way. What is good enough and what not good enough has to be measured and measurable. So if the big nettle is 4X slower then the small nettle, it what a denial of service? I don't think so! What about 8x, or 10x? may be? but I still don't think so because example given are extremely unlikely in the real world, and if a deliberately bad case you may just see a slightly higher peak load. On a server with 100-1000s processors who would notice? 100x well that might be a problem. So what is you number? Because quadratic and O(this) and O(that) are meaningless in this context.
On Thu, Jul 16, 2015 at 04:16:12PM -0400, Carlos O'Donell wrote: > On 07/16/2015 03:55 PM, Ondřej Bílka wrote: > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: > >>>> May I proceed with this commit? > >>> > >>> Yes, please commit this for 2.22. > >> > >> For the record I trust IBM to make sure these patches make incremental > >> improvements in performance even if they are not the best possible > >> performance as pointed out by Ondrej Bilka. > >> > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at > > all. I did that as test how much we could test IBM to verify patches. > > > > I pointed out that it could have possibly quadratic behaviour which > > still does. So please don't accept unreviewed patches next time. > > They showed cases for which the code does go faster and objectively > so using the microbenchmark, and that's a win for now. Please continue > to work with IBM to remove the quadratic worst case. > Carlos, that benchmark lack consensus. When I wrote it at thread it was explicitly mentioned as counterexample why it shouldn't be used. Florian raised objection that strstr isn't used on random inputs. Also I don't think that its 'objective' benchmark at all, for example it never measures case when we find string could happen relatively often when user knows that string is there.
On Wed, Jul 22, 2015 at 04:12:47PM +0000, Joseph Myers wrote: > On Thu, 16 Jul 2015, Carlos O'Donell wrote: > > > On 07/16/2015 03:55 PM, Ondřej Bílka wrote: > > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: > > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: > > >>>> May I proceed with this commit? > > >>> > > >>> Yes, please commit this for 2.22. > > >> > > >> For the record I trust IBM to make sure these patches make incremental > > >> improvements in performance even if they are not the best possible > > >> performance as pointed out by Ondrej Bilka. > > >> > > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at > > > all. I did that as test how much we could test IBM to verify patches. > > > > > > I pointed out that it could have possibly quadratic behaviour which > > > still does. So please don't accept unreviewed patches next time. > > > > They showed cases for which the code does go faster and objectively > > so using the microbenchmark, and that's a win for now. Please continue > > to work with IBM to remove the quadratic worst case. > > > > Tulio, You will need to work out why you have quadratic worst case. > > It's certainly something we try to avoid. Did you make a particular > > decision not to avoid it? > > If there's a quadratic worst case newly introduced for 2.22, I'd consider > that a security hole (denial of service) that needs to block the release > of 2.22 until it's fixed (possibly by removing the implementation in > question). > No my objection was that this should be reverted as patch that wasn't reviewed. My main reason to believe are numerous issues that every competent reviewer should notice. First red flag was Tulio telling its ok for me, then telling that he received message there is bug in strstr. That doesn't inspire much confidence. First as quadratic behaviour. I knew that its bounded by 2048 but as Tulio told that there is no work done I was silent. As he told that there is no work on that I realized that he either didn't read patch or forgot that he reviewed part which calls strlen, strnlen and jumps to generic strstr. I and you read code and find 2048 constant weird. As reviewer I asked to clarify it but author just send unrelated benchmarks, never one that would convince us. It is as on benchmark that I previously send on thread I show that its 100 times slower for that needle as I claimed. If Tulio was familiar with code he would suggest to decrease that constant to 32. As quadratic its mostly shortcut as I don't want to repeat it. Its beyond my understanding when one believes that in practice O(mn) algorithm with m=2048 could beat O(n) one. I raised that in all versions of strstr/strcasestr. Thats just simple issue, I raised that months ago so they had enough time. Then as matter of performance in general case I found several obvious problems so why Tulio didn't? First is that strlen/strnlen is called unnecessarily when first strchr call returns null. Thats hot path. Second is that crosspage check is unnecessary as you could search for ending character of needle. These are arguments why I don't think it was reviewed. Joseph, what is policy about assembly implementations when there is better generic implementation? This assembly is slower than when do simple optimization of C strstr, with patch. I repeatedly asked to check it but never got reply. [PATCH] Improve generic strstr performance. I submitted C implementation that beats assembly strstr previously in thread. To answer objections that benchmarks are inaccurate as strings in practice are not random strings I used dryrun as I said earlier. I recorded strstr calls mainly in gdb program. I saved profile trace here for replicability. http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2 With assembly I get following: [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u replaying gdb took 26567826 took 26699020 [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u replaying gdb took 26608298 took 26698638 [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u replaying gdb took 26752268 took 26876982 replaying bash took 25841770 But when I replace that with generic improvement it improves perfromance in practice. [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u replaying gdb took 25096738 took 25187742 [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u replaying gdb took 25182840 took 25255354 [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u replaying gdb took 25060592 took 25165898 Main reason that I explained before is that it goes in considerable lengths to optimizes cold path while not optimizing hot one. In random benchmark probability that you match k characters from needle decreases exponentially (Similar behaviour is observed in practice.) This patch for needles less than 8 bytes does rougthly following, larger needles have more complicated code to check 8 bytes of needle at time. nsize = strlen(needle) if (strnlen(haystack,nsize) < nsize) return NULL; if (nsize <= 9) { needle_start = *((uint64_t*)needle+1); mask = (~0) >> (9 - nsize); while (1) { haystack = strchr (haystack, needle[0]); if (*((uint64_t*)haystack) & mask == needle_start) return haystack; haystack++; } Problem is that optimizations like if (*((uint64_t*)(haystack+1)) & mask == needle_start) return haystack; Do more harm than good. Using simple byte check is faster in practice as probably mismatch occurs in second character and overhead of testing more characters is bigger than branch misprediction penalty. When I ran profile trace then average number of digraphs and trigraphs is following: digraph count: 0.0126 trigraph count: 0.0024 On other hand it does nothing with hot strchr calls. If performance is concern these should be at least inlined, or iterate over first character mask. I am only telling from my experience what optimizations will help. Optimizations in this patch will not.
On Sat, 2015-07-25 at 10:20 +0200, Ondřej Bílka wrote: > On Wed, Jul 22, 2015 at 04:12:47PM +0000, Joseph Myers wrote: > > On Thu, 16 Jul 2015, Carlos O'Donell wrote: > > > > > On 07/16/2015 03:55 PM, Ondřej Bílka wrote: > > > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote: > > > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote: > > > >>>> May I proceed with this commit? > > > >>> > > > >>> Yes, please commit this for 2.22. > > > >> > > > >> For the record I trust IBM to make sure these patches make incremental > > > >> improvements in performance even if they are not the best possible > > > >> performance as pointed out by Ondrej Bilka. > > > >> > > > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at > > > > all. I did that as test how much we could test IBM to verify patches. > > > > > > > > I pointed out that it could have possibly quadratic behaviour which > > > > still does. So please don't accept unreviewed patches next time. > > > > > > They showed cases for which the code does go faster and objectively > > > so using the microbenchmark, and that's a win for now. Please continue > > > to work with IBM to remove the quadratic worst case. > > > > > > Tulio, You will need to work out why you have quadratic worst case. > > > It's certainly something we try to avoid. Did you make a particular > > > decision not to avoid it? > > > > If there's a quadratic worst case newly introduced for 2.22, I'd consider > > that a security hole (denial of service) that needs to block the release > > of 2.22 until it's fixed (possibly by removing the implementation in > > question). > > > No my objection was that this should be reverted as patch that wasn't > reviewed. My main reason to believe are numerous issues that every > competent reviewer should notice. First red flag was Tulio telling its > ok for me, then telling that he received message there is bug in strstr. > This is ridiculous. most of this personal and unprofessional attack below is based on incorrect assumptions (of things you do not actually know) or taken out of context and twisted to fit your narrative. You really need to take a deep breath and think before you say anything else. > That doesn't inspire much confidence. > > First as quadratic behaviour. I knew that its bounded by 2048 but as > Tulio told that there is no work done I was silent. As he told that > there is no work on that I realized that he either didn't read patch or > forgot that he reviewed part which calls strlen, strnlen and jumps to > generic strstr. > > I and you read code and find 2048 constant weird. As reviewer I asked to > clarify it but author just send unrelated benchmarks, never one that > would convince us. It is as on benchmark that I previously send on thread > I show that its 100 times slower for that needle as I claimed. > If Tulio was familiar with code he would suggest to decrease that constant to 32. > > As quadratic its mostly shortcut as I don't want to repeat it. Its > beyond my understanding when one believes that in practice O(mn) algorithm > with m=2048 could beat O(n) one. I raised that in all versions of > strstr/strcasestr. > > Thats just simple issue, I raised that months ago so they had enough > time. > > Then as matter of performance in general case I found several obvious > problems so why Tulio didn't? First is that strlen/strnlen is called > unnecessarily when first strchr call returns null. Thats hot path. > Second is that crosspage check is unnecessary as you could search for > ending character of needle. > > These are arguments why I don't think it was reviewed. > > Joseph, what is policy about assembly implementations when there is > better generic implementation? This assembly is slower than when do > simple optimization of C strstr, with patch. I repeatedly asked to > check it but never got reply. > > [PATCH] Improve generic strstr performance. > > I submitted C implementation that beats assembly strstr previously in > thread. > > To answer objections that benchmarks are inaccurate as strings in > practice are not random strings I used dryrun as I said earlier. I > recorded strstr calls mainly in gdb program. > > I saved profile trace here for replicability. > http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2 > > With assembly I get following: > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > replaying gdb took 26567826 > took 26699020 > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > replaying gdb took 26608298 > took 26698638 > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > replaying gdb took 26752268 > took 26876982 > > > replaying bash took 25841770 > > But when I replace that with generic improvement it improves perfromance > in practice. > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > replaying gdb took 25096738 > took 25187742 > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > replaying gdb took 25182840 > took 25255354 > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > replaying gdb took 25060592 > took 25165898 > > > > > Main reason that I explained before is that it goes in considerable > lengths to optimizes cold path while not optimizing hot one. > In random benchmark probability that you match k characters from needle > decreases exponentially (Similar behaviour is observed in practice.) > > This patch for needles less than 8 bytes does rougthly following, larger > needles have more complicated code to check 8 bytes of needle at time. > > nsize = strlen(needle) > if (strnlen(haystack,nsize) < nsize) > return NULL; > > if (nsize <= 9) > { > needle_start = *((uint64_t*)needle+1); > mask = (~0) >> (9 - nsize); > while (1) > { > haystack = strchr (haystack, needle[0]); > if (*((uint64_t*)haystack) & mask == needle_start) > return haystack; > haystack++; > } > > Problem is that optimizations like > > if (*((uint64_t*)(haystack+1)) & mask == needle_start) > return haystack; > > Do more harm than good. Using simple byte check is faster in practice as > probably mismatch occurs in second character and overhead of testing > more characters is bigger than branch misprediction penalty. When I ran > profile trace then average number of digraphs and trigraphs is > following: > > digraph count: 0.0126 trigraph count: 0.0024 > > On other hand it does nothing with hot strchr calls. If performance is > concern these should be at least inlined, or iterate over first > character mask. > > I am only telling from my experience what optimizations will help. > Optimizations in this patch will not. >
On Sun, Jul 26, 2015 at 08:26:06PM -0500, Steven Munroe wrote: > On Sat, 2015-07-25 at 10:20 +0200, Ondřej Bílka wrote: > > No my objection was that this should be reverted as patch that wasn't > > reviewed. My main reason to believe are numerous issues that every > > competent reviewer should notice. First red flag was Tulio telling its > > ok for me, then telling that he received message there is bug in strstr. > > > > This is ridiculous. most of this personal and unprofessional attack > below is based on incorrect assumptions (of things you do not actually > know) or taken out of context and twisted to fit your narrative. > > You really need to take a deep breath and think before you say anything > else. > Could you stop doing ad hominem attacks please? I noticed that you resort to these when I show that facts speak otherwise. As this being out of context I expected that you will of course deny everything and say something like that. So could you answer following three questions? 1. Reviews, it would be better to always write reviews publictly. But as you still have them could you send reviews here to clarify. 2. Quadratic behaviour. It should be clear for any expert that with 2047 byte needle you need check all these bytes for each byte in haystack. If you optimisticaly check 8 bytes per cycle then it would take 256 cycles times haystack size. In practice its slower as its equivalent to strcmp which on power7 has following timing: Length 1024, alignment 14/ 7: 593.5 651.016 318.844 447.203 So why you didn't insist to fix that by changing constant to something reasonable? 3. Why despite your frequent reviews a implementation is in average case still slower than simple bugfix of generic strstr. As for that I do incorrect assumptions and take things out of context its quite easy to make vague accussations like this. So be more specific, what assumptions and where is your evidence that assumption is incorrect? > > > That doesn't inspire much confidence. > > > > First as quadratic behaviour. I knew that its bounded by 2048 but as > > Tulio told that there is no work done I was silent. As he told that > > there is no work on that I realized that he either didn't read patch or > > forgot that he reviewed part which calls strlen, strnlen and jumps to > > generic strstr. > > > > I and you read code and find 2048 constant weird. As reviewer I asked to > > clarify it but author just send unrelated benchmarks, never one that > > would convince us. It is as on benchmark that I previously send on thread > > I show that its 100 times slower for that needle as I claimed. > > If Tulio was familiar with code he would suggest to decrease that constant to 32. > > > > As quadratic its mostly shortcut as I don't want to repeat it. Its > > beyond my understanding when one believes that in practice O(mn) algorithm > > with m=2048 could beat O(n) one. I raised that in all versions of > > strstr/strcasestr. > > > > Thats just simple issue, I raised that months ago so they had enough > > time. > > > > Then as matter of performance in general case I found several obvious > > problems so why Tulio didn't? First is that strlen/strnlen is called > > unnecessarily when first strchr call returns null. Thats hot path. > > Second is that crosspage check is unnecessary as you could search for > > ending character of needle. > > > > These are arguments why I don't think it was reviewed. > > > > Joseph, what is policy about assembly implementations when there is > > better generic implementation? This assembly is slower than when do > > simple optimization of C strstr, with patch. I repeatedly asked to > > check it but never got reply. > > > > [PATCH] Improve generic strstr performance. > > > > I submitted C implementation that beats assembly strstr previously in > > thread. > > > > To answer objections that benchmarks are inaccurate as strings in > > practice are not random strings I used dryrun as I said earlier. I > > recorded strstr calls mainly in gdb program. > > > > I saved profile trace here for replicability. > > http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2 > > > > With assembly I get following: > > > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > > > > replaying gdb took 26567826 > > took 26699020 > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > > > > replaying gdb took 26608298 > > took 26698638 > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > > > > replaying gdb took 26752268 > > took 26876982 > > > > > > replaying bash took 25841770 > > > > But when I replace that with generic improvement it improves perfromance > > in practice. > > > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > > > > replaying gdb took 25096738 > > took 25187742 > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > > > > replaying gdb took 25182840 > > took 25255354 > > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u > > > > > > replaying gdb took 25060592 > > took 25165898 > > > > > > > > > > Main reason that I explained before is that it goes in considerable > > lengths to optimizes cold path while not optimizing hot one. > > In random benchmark probability that you match k characters from needle > > decreases exponentially (Similar behaviour is observed in practice.) > > > > This patch for needles less than 8 bytes does rougthly following, larger > > needles have more complicated code to check 8 bytes of needle at time. > > > > nsize = strlen(needle) > > if (strnlen(haystack,nsize) < nsize) > > return NULL; > > > > if (nsize <= 9) > > { > > needle_start = *((uint64_t*)needle+1); > > mask = (~0) >> (9 - nsize); > > while (1) > > { > > haystack = strchr (haystack, needle[0]); > > if (*((uint64_t*)haystack) & mask == needle_start) > > return haystack; > > haystack++; > > } > > > > Problem is that optimizations like > > > > if (*((uint64_t*)(haystack+1)) & mask == needle_start) > > return haystack; > > > > Do more harm than good. Using simple byte check is faster in practice as > > probably mismatch occurs in second character and overhead of testing > > more characters is bigger than branch misprediction penalty. When I ran > > profile trace then average number of digraphs and trigraphs is > > following: > > > > digraph count: 0.0126 trigraph count: 0.0024 > > > > On other hand it does nothing with hot strchr calls. If performance is > > concern these should be at least inlined, or iterate over first > > character mask. > > > > I am only telling from my experience what optimizations will help. > > Optimizations in this patch will not. > >
On Mon, 2015-07-27 at 11:19 +0200, Ondřej Bílka wrote: > On Sun, Jul 26, 2015 at 08:26:06PM -0500, Steven Munroe wrote: > > On Sat, 2015-07-25 at 10:20 +0200, Ondřej Bílka wrote: > > > No my objection was that this should be reverted as patch that wasn't > > > reviewed. My main reason to believe are numerous issues that every > > > competent reviewer should notice. First red flag was Tulio telling its > > > ok for me, then telling that he received message there is bug in strstr. > > > > > > > This is ridiculous. most of this personal and unprofessional attack > > below is based on incorrect assumptions (of things you do not actually > > know) or taken out of context and twisted to fit your narrative. > > > > You really need to take a deep breath and think before you say anything > > else. > > > > Could you stop doing ad hominem attacks please? I noticed that you > resort to these when I show that facts speak otherwise. > > As this being out of context I expected that you will of course deny > everything and say something like that. So could you answer following > three questions? > > 1. Reviews, it would be better to always write reviews publictly. But as > you still have them could you send reviews here to clarify. > > 2. Quadratic behaviour. It should be clear for any expert that with 2047 > byte needle you need check all these bytes for each byte in haystack. If > you optimisticaly check 8 bytes per cycle then it would take 256 cycles > times haystack size. In practice its slower as its equivalent to strcmp > which on power7 has following timing: > Length 1024, alignment 14/ 7: 593.5 651.016 318.844 447.203 > > So why you didn't insist to fix that by changing constant to something > reasonable? > > 3. Why despite your frequent reviews a implementation is in average case > still slower than simple bugfix of generic strstr. > > As for that I do incorrect assumptions and take things out of context > its quite easy to make vague accussations like this. So be more > specific, what assumptions and where is your evidence that assumption > is incorrect? > What we have here is complete failure to communicate. On June 9th: I suggested that if you felt so strongly about this for reasons that I did not understand, then you should provide an objective benchmark that provided you point. https://sourceware.org/ml/libc-alpha/2015-06/msg00365.html On June 10th Roland agreed that that was a reasonable way to proceed. https://sourceware.org/ml/libc-alpha/2015-06/msg00367.html You rejected this. So Raji continued to use the one strstr benchmark we have. We tried to compromise by adjusting the crossover point between the new and old algorithm and avoid the specific case you gave. And you rejected that. On July 16th Raji submitted the V3 of the patch https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html On June 16th you basically told Carlos that my team was incompetent and could not be trusted, hhhm does this qualify as an ad hominem attach? https://sourceware.org/ml/libc-apha/2015-07/msg00501.html At which point I stop listening. There does not seem to be anything that I can say or do to satisfy you. So I stopped trying. On July 23rd Carlos asked for a second opinion on if what we had was in fact quadratics. https://sourceware.org/ml/libc-alpha/2015-07/msg00799.html The consensus was no, and was at worse O(nm) This has wasted huge amounts of time. Time we could have used to improved the code for the next release. I will make you a promise. If you will provide the objective benchmark that illustrates this issue and get the community to accept it as an objective measure of a realistic scenario. I will ask the team to work on appropriate improvements.
On Sat, 25 Jul 2015, Ondřej Bílka wrote: > Joseph, what is policy about assembly implementations when there is > better generic implementation? This assembly is slower than when do Removal of such an assembly implementation is a simple matter of getting consensus, typically based on benchmark results from the checked-in benchtests (so if necessary, you may need to get consensus on an addition to such benchtests first). The most likely form of such consensus is through agreement from the architecture maintainer, but I don't expect such changes, properly supported by benchmark results from the checked-in benchtests, to be controversial. Of course the comparison is with the *checked-in* generic implementation, not something posted but not checked in. > simple optimization of C strstr, with patch. I repeatedly asked to > check it but never got reply. > > [PATCH] Improve generic strstr performance. Well, if a patch takes a while to get reviewed, keep pinging, while making sure that the patch submission includes all necessary information, that the required benchmarks have been reviewed and checked in first, etc. You have about 80 unreviewed patches showing in the patchwork list. If any of those are superseded by later versions of the same patch, I strongly advise marking the earlier versions as superseded in patchwork so that anyone using patchwork to find patches to review finds those patches of yours that actually need review and not a load of noise from older superseded patch versions. (The same applies to anyone submitting patches - keep the list of your patches pending review clean - but your list is particularly long.)
On 07/27/2015 05:19 AM, Ondřej Bílka wrote: > 1. Reviews, it would be better to always write reviews publictly. But as > you still have them could you send reviews here to clarify. As machine maintainers they can commit their code as-if they had consensus. Your after-the-fact reviews are good, and should be considered follow-on work. You should work closely with IBM in a professional and technical manner, present your own patches to the existing IBM code and discus the testing and changes that you made. All of this *on top* of whatever is in the existing master branch. You must present this work in a clear and concise manner, providing IBM the tools with which to evaluate and test your implementation. The IBM maintainers do not need to convince you. They have consensus as maintainers. You need to convince *them* that your solution is better rather than attempting to block their patches, which is not your responsibility. Lastly, beware that your single dissenting opinion may not constitute an important part of the concerned interests of the glibc community[1]. Therefore, even if you are correct, the community may tell you that your comments will not be considered until you find a way to work with the machine maintainer. I for one would like to see you working *with* IBM instead of what appears to be an antagonistic realtionship surrounding these performance-related changes. I think both sides should look to the positive aspects of having more people reviewing existing implementations for performance benefits. Cheers, Carlos. [1] https://sourceware.org/glibc/wiki/Consensus
diff --git a/benchtests/bench-strstr.c b/benchtests/bench-strstr.c index 74f3ee8..b758969 100644 --- a/benchtests/bench-strstr.c +++ b/benchtests/bench-strstr.c @@ -128,6 +128,23 @@ test_main (void) printf ("\t%s", impl->name); putchar ('\n'); + + char s1[1000000],s2[2000]; + + memset (s1, 'a', 1000000); + memset (s2, 'a', 2000); + s1[999999] = '\0'; + s2[1998] = 'b'; + s2[1999] = '\0'; + + { + printf ("strstr(\"aaa...aaa\",\"aaa...aaab\"\n"); + FOR_EACH_IMPL (impl, 0) + do_one_test (impl, s1, s2, NULL); + putchar('\n'); + } + + for (size_t klen = 2; klen < 32; ++klen) for (size_t hlen = 2 * klen; hlen < 16 * klen; hlen += klen) {