Message ID | DB5PR08MB1030E559B3818F52CA48A63583F30@DB5PR08MB1030.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | Use Quicksearch in strstr | expand |
On Mon, 29 Oct 2018, Wilco Dijkstra wrote: > The needle length is limited to 254 - combined with a hashed shift table > this reduces the shift table memory 16 to 32 times, lowering preprocessing > overhead and minimizing cache effects. This is a bit unclear: given needle length limit, unhashed table would require 256 bytes, hashing as implemented in the patch reduces it to 64 bytes. I think unhashed variant should be even faster, is saving 192 bytes on stack worth it? > The performance gain using the improved bench-strstr on Cortex-A72 is 2.1x. > Due to the low overhead the gain is larger on small haystacks: haystacks of 64 > chars are ~3 times faster for needles of 5-16 chars. How does the worst case change look like (I think it's searching for "a{251}ba" in "aaa...aaa" if memcmp works left-to-right)? Does it make sense to move needle length cutoff further below 254? > * string/str-two-way.h (two_way_short_needle): Add inline to avoid warning. > * string/strstr.c (strstr2): Add new function. > (strstr3): Likewise. > (strstr3): Likewise. > (STRSTR): Completely rewrite strstr to improve performance Generating the patch with 'git diff --patience' may produce a more readable diff. > + /* Use the Quick-Search algorithm for needle lengths less than 255. */ > + if (__glibc_likely (ne_len < 255)) > + { > + uint8_t shift[1 << SHIFT_TABLE_BITS]; > + const unsigned char *end = hs + hs_len - ne_len; > + > + /* Initialize bad character shift hash table. */ > + memset (shift, ne_len + 1, sizeof (shift)); > + for (int i = 0; i < ne_len; i++) > + shift[ne[i] % sizeof (shift)] = ne_len - i; > + > + do > + { > + hs--; Doesn't this move hs out of bounds and thus invoke undefined behavior? Thanks. Alexander
On Mon, Oct 29, 2018 at 03:55:36PM +0000, Wilco Dijkstra wrote: > This patch significantly improves performance of strstr by using Sunday's > Quick-Search algorithm. Due to its simplicity it has the best average > performance of string matching algorithms on almost all inputs. It uses a > bad-character shift table to skip past mismatches. > > The needle length is limited to 254 - combined with a hashed shift table > this reduces the shift table memory 16 to 32 times, lowering preprocessing > overhead and minimizing cache effects. The needle limit also implies > worst-case performance remains linear. > > Small 1-4 byte needles use special case code which is typically faster. > Very long needles continue to use the linear-time Two-Way algorithm. > > The performance gain using the improved bench-strstr on Cortex-A72 is 2.1x. > Due to the low overhead the gain is larger on small haystacks: haystacks of 64 > chars are ~3 times faster for needles of 5-16 chars. > Real benefit is fixing idiocy of calculating needle size before calling strchr, code after that tries to make as this make cold path faster. Better would be to call strchr again and if it still doesn't match for third time then code after that shouldn't matter as long as its linear. If you capture inputs of actual applications you will find what are bottlenecks. That "benchmark" is complete garbage. In practice running time is dominated by branch misprediction etc. But "benchmark" calls function many times on same input leading to no branch misprediction. You couldn't do much with needle as inputs are really small, precomputation is just too slow. Even strlen of needle would be bottleneck. Also strcmp when you take branch misprediction into account is too slow, I tried lot of things for that comparison and nothing could beat doing it byte-by-byte unrolled 4 times or so. As for real inputs record someing with this http://kam.mff.cuni.cz/~ondra/dryrun.tar.bz2 make LD_PRELOAD=./dryrun_strstr.so bash do something, exit bash ./summary_strstr which prints statistic like this to see real data. average size 21.28 needle size 6.67 comparisons 23.00 digraphs 0.22 trigraphs 0.00 calls 1907 succeed 61.4% latencies 0.0 0.0 s1 aligned to 4 bytes 81.9% aligned to 8 bytes 81.6% aligned to 16 bytes 31.7% s2 aligned to 4 bytes 67.0% aligned to 8 bytes 8.0% aligned to 16 bytes 7.9% needle found in n bytes: n <= 0: 7.9% n <= 1: 7.9% n <= 2: 11.9% n <= 3: 19.7% n <= 4: 19.7% n <= 8: 19.7% n <= 16: 65.3% n <= 32: 81.1% n <= 64: 96.8% needle size: n <= 0: 7.9% n <= 1: 7.9% n <= 2: 7.9% n <= 3: 7.9% n <= 4: 37.8% n <= 8: 96.9% n <= 16: 96.9% n <= 32: 100.0% n <= 64: 100.0% For arm and anything with assembly strchr you would gain more by similar trick as I did for x64, for hot path its easy modification of strchr where you instead of one character check first two characters where convient(function leaky_digraph), sometimes you check only first one rather than trying to include character with unsuitable alignment. Note that unaligned loads by one character more could be faster than trying to shift vectors. logic is following: strstr(s,n) { if (!n[0]) return s; if (!n[1]) return strchr (s, n); // corner case, one-char needles are rare s = leaky_digraph(s, n[0], n[1]); if (!s) return NULL; for (i = 1; n[i] && s[i] == n[i]; i++); if (!n[i]) return s; return strstr_two_way(s,n); } Improvements after that are mostly QoI. For bigger haystacks trick is to do code above in loop and find constants a,b,c,d s.t. time(two_way) > a*(haystack-haystack_start)+b and c * characters_compared + d * leaky_digraph_calls > time(leaky_digraph_loop) and add to loop chec to switch to two_way when a*(haystack-haystack_start)+b < c * characters_compared + d * leaky_digraph_calls One could optimize code more by iterating mask instead of calling leaky_digraph again but thats to decrease constants in case where digraphs are quite frequent.
On Mon, Oct 29, 2018 at 6:54 PM Ondřej Bílka <neleai@seznam.cz> wrote: > Real benefit is fixing idiocy ... > That "benchmark" is complete garbage ... When you dismiss other people's testing this aggressively you make them not want to cooperate with you. If I had written this patch, my reaction to this feedback would be: Fine, fix it yourself if you think my approach is idiotic. It is perfectly possible to express concerns in a less hostile manner. For instance, you could have said "I think this benchmark doesn't account for the cost of branch misprediction, because it calls strstr many times with the same inputs. Can you please test it on a wider variety of inputs in a randomized order?" zw
Hi, Alexander Monakov wrote: > On Mon, 29 Oct 2018, Wilco Dijkstra wrote: > >> The needle length is limited to 254 - combined with a hashed shift table >> this reduces the shift table memory 16 to 32 times, lowering preprocessing >> overhead and minimizing cache effects. > > This is a bit unclear: given needle length limit, unhashed table would require > 256 bytes, hashing as implemented in the patch reduces it to 64 bytes. I think > unhashed variant should be even faster, is saving 192 bytes on stack worth it? The startup overhead is significant for small haystacks, so reducing that is actually faster. The hashing uses a simple AND instruction which is hidden across the memcmp call, so doesn't cost much on a modern core. > How does the worst case change look like (I think it's searching for "a{251}ba" > in "aaa...aaa" if memcmp works left-to-right)? Does it make sense to move > needle length cutoff further below 254? To be fair you have to compare the worst case of one algorithm vs the worst case of another since they are different... Two-way is slow on searching for ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times faster on this case despite having a theoretical factor of 256/2 advantage. I don't think there is any gain slowing down common cases in order to reduce the worst-case difference (assuming that is feasibe). > + do > + { > + hs--; > Doesn't this move hs out of bounds and thus invoke undefined behavior? That's a theoretical issue given *(a+(i-1)), *((a-1)+i) and *((a+i)-1) behave identically. Wilco
Hi, Ondřej Bílka wrote: > Real benefit is fixing idiocy of calculating needle size before calling > strchr, code after that tries to make as this make cold path faster. Doing a single strchr early helps indeed if the first char of needle occurs infrequently or it finds the match. But I don't see it improving performance much. > Better would be to call strchr again and if it still doesn't match for third time then > code after that shouldn't matter as long as its linear. Repeated strchr/memchr seems to be very bad for performance on characters that occur frequently. > If you capture inputs of actual applications you will find what are bottlenecks. Yes but we'd need applications which use strstr a lot. > As for real inputs record someing with this > http://kam.mff.cuni.cz/~ondra/dryrun.tar.bz2 Thanks, I'll have a look. > For arm and anything with assembly strchr you would gain more by similar trick as I did for x64, > for hot path its easy modification of strchr where you instead of one > character check first two characters where convient(function leaky_digraph), sometimes you check > only first one rather than trying to include character with unsuitable > alignment. Note that unaligned loads by one character more could be > faster than trying to shift vectors. Sure an assembly implementation could easily search for 2 characters at a time. But right now I'm improving the generic C version first. It's not even using unaligned accesses yet - and those can give significant gains both in brute force search and smarter algorithms like Quicksearch. logic is following: strstr(s,n) { if (!n[0]) return s; if (!n[1]) return strchr (s, n); // corner case, one-char needles are rare s = leaky_digraph(s, n[0], n[1]); if (!s) return NULL; for (i = 1; n[i] && s[i] == n[i]; i++); if (!n[i]) return s; return strstr_two_way(s,n); } Yes it would be feasible to add a digraph test like that to replace the initial strchr check. Do you have an idea what percentage of calls it filters compared to plain strchr? Wilco
On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote: > Alexander Monakov wrote: > > How does the worst case change look like (I think it's searching for "a{251}ba" > > in "aaa...aaa" if memcmp works left-to-right)? Does it make sense to move > > needle length cutoff further below 254? This "quicksearch" is fundamentally a bad algorithm. If it looks fast in some cases, it's just a matter of getting lucky, and most likely due to bad tuning or missed optimizations in the implementation of the algorithm you're comparing it against. It's fundamentally a naive O(nm) search that will be this bad in all cases except where the bad-character table happens to let you advance a lot. > To be fair you have to compare the worst case of one algorithm vs the worst > case of another since they are different... Two-way is slow on searching for > ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times > faster on this case despite having a theoretical factor of 256/2 advantage. Do you have a basis for calling this "slow"? It should be roughly one comparison per byte. "Only 3x faster" doesn't sound slow to me. > I don't think there is any gain slowing down common cases in order to > reduce the worst-case difference (assuming that is feasibe). This argument gets made about strstr every couple years, and every time it's wrong. It's not that "common cases" (as a general class) are faster with any naive algorithm. Only very specific cases will ever be faster. Most common cases are roughly the same speed as with a good algorithm, and some common cases, plus a lot more uncommon ones, are catastrophically slower with a bad algorithm. To my knowledge there is no good test for a case that won't become catastrophically slower with a bad algorithm. If there is, I would be very surprised if the same test doesn't naturally translate into an optimized setup for two-way. > > + do > > + { > > + hs--; > > > Doesn't this move hs out of bounds and thus invoke undefined behavior? > > That's a theoretical issue given *(a+(i-1)), *((a-1)+i) and *((a+i)-1) behave > identically. Calling blatant undefined behavior "a theoretical issue" is not something we should be doing in 2018. Rich
On Thu, Nov 01, 2018 at 12:42:35PM -0400, Rich Felker wrote: > On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote: > > Alexander Monakov wrote: > > > How does the worst case change look like (I think it's searching for "a{251}ba" > > > in "aaa...aaa" if memcmp works left-to-right)? Does it make sense to move > > > needle length cutoff further below 254? > > This "quicksearch" is fundamentally a bad algorithm. If it looks fast > in some cases, it's just a matter of getting lucky, and most likely > due to bad tuning or missed optimizations in the implementation of the > algorithm you're comparing it against. It's fundamentally a naive > O(nm) search that will be this bad in all cases except where the > bad-character table happens to let you advance a lot. > > > To be fair you have to compare the worst case of one algorithm vs the worst > > case of another since they are different... Two-way is slow on searching for > > ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times > > faster on this case despite having a theoretical factor of 256/2 advantage. > > Do you have a basis for calling this "slow"? It should be roughly one > comparison per byte. "Only 3x faster" doesn't sound slow to me. I think there may actually be a bug (just suboptimality) in the two-way implementation related to this case. It's a periodic needle that uses 'memory', and glibc has (abbreviated): if (memory && shift < period) shift = needle_len - period; From what I recall (having implemented the same), this is to ensure that it doesn't advance too little, requiring re-comparing a segment already compared as the right-hand factor. However, it's possible that shift is already larger than needle_len - period, and in that case I see no reason to use the memory rather than advancing by the longer amount. It's certainly not true in general that the shift of a given byte is bounded by needle_len - period. For example, zabc...xyz factors as [z][abc...xyz] where 'a' has a shift of 25, and needle_len - period is 1. Of course in this case, after the truncated shift by 1, the memory will be gone, and the next shift won't be truncated, so it might not matter in the total run time. Anyone else interested in looking at this to confirm if my findings make sense? Rich
On Thu, Nov 01, 2018 at 03:27:38PM -0400, Rich Felker wrote: > On Thu, Nov 01, 2018 at 12:42:35PM -0400, Rich Felker wrote: > > On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote: > > > Alexander Monakov wrote: > > > > How does the worst case change look like (I think it's searching for "a{251}ba" > > > > in "aaa...aaa" if memcmp works left-to-right)? Does it make sense to move > > > > needle length cutoff further below 254? > > > > This "quicksearch" is fundamentally a bad algorithm. If it looks fast > > in some cases, it's just a matter of getting lucky, and most likely > > due to bad tuning or missed optimizations in the implementation of the > > algorithm you're comparing it against. It's fundamentally a naive > > O(nm) search that will be this bad in all cases except where the > > bad-character table happens to let you advance a lot. > > > > > To be fair you have to compare the worst case of one algorithm vs the worst > > > case of another since they are different... Two-way is slow on searching for > > > ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times > > > faster on this case despite having a theoretical factor of 256/2 advantage. > > > > Do you have a basis for calling this "slow"? It should be roughly one > > comparison per byte. "Only 3x faster" doesn't sound slow to me. > > I think there may actually be a bug (just suboptimality) in the > two-way implementation related to this case. It's a periodic needle > that uses 'memory', and glibc has (abbreviated): > > if (memory && shift < period) > shift = needle_len - period; > > From what I recall (having implemented the same), this is to ensure > that it doesn't advance too little, requiring re-comparing a segment > already compared as the right-hand factor. However, it's possible that > shift is already larger than needle_len - period, and in that case I > see no reason to use the memory rather than advancing by the longer > amount. > > It's certainly not true in general that the shift of a given byte is > bounded by needle_len - period. For example, zabc...xyz factors as > [z][abc...xyz] where 'a' has a shift of 25, and needle_len - period is > 1. Of course in this case, after the truncated shift by 1, the memory > will be gone, and the next shift won't be truncated, so it might not > matter in the total run time. > > Anyone else interested in looking at this to confirm if my findings > make sense? I'm pretty sure this is a logic bug. Except for bytes that don't appear in the needle (for which shift==needle_len), shift < period is automatically true. The needle being periodic means that each byte which appears in it appears within the last 'period' positions of it. I suspect what was actually intended was: if (memory && shift < needle_len - period) shift = needle_len - period; which can be simplified to: if (shift < memory) shift = memory; Is this okay? Rich
Rich Felker wrote: > This "quicksearch" is fundamentally a bad algorithm. If it looks fast > in some cases, it's just a matter of getting lucky, and most likely > due to bad tuning or missed optimizations in the implementation of the > algorithm you're comparing it against. It's fundamentally a naive > O(nm) search that will be this bad in all cases except where the > bad-character table happens to let you advance a lot. I disagree. What's important is real world performance. Practically all good search algorithms are O(nm) and they beat any O(n) algorithm by huge factors on actual inputs. That should teach you that a concept like the O notation does not guarantee good performance. > Do you have a basis for calling this "slow"? It should be roughly one > comparison per byte. "Only 3x faster" doesn't sound slow to me. One comparison per byte is slow indeed. Modern search algorithms are sublinear and achieve O(n/m) comparisons. You can see this when benchmarking - performance actually increases with longer needles. >> I don't think there is any gain slowing down common cases in order to >> reduce the worst-case difference (assuming that is feasibe). > > This argument gets made about strstr every couple years, and every > time it's wrong. It's not that "common cases" (as a general class) are > faster with any naive algorithm. Only very specific cases will ever be > faster. Most common cases are roughly the same speed as with a good > algorithm, and some common cases, plus a lot more uncommon ones, are > catastrophically slower with a bad algorithm. Actually it's the correct argument. Common cases are indeed faster with modern algorithms - slow algorithms like Two-way can't speed up those common cases. There is only very specific small set of inputs which can trigger worst-case behaviour, and those are never typical inputs. And it's easy to fall back to a different algorithm in such a case. > To my knowledge there is no good test for a case that won't become > catastrophically slower with a bad algorithm. If there is, I would be > very surprised if the same test doesn't naturally translate into an > optimized setup for two-way. Actually it's trivial to find the worst case for each algorithm, you just ensure you hit the slowest path all the time. > Calling blatant undefined behavior "a theoretical issue" is not > something we should be doing in 2018. It's 100% theoretical - or you can show an example that could fail? It's crazy to even bring this stuff up in the 21th century. Software that does not rely on undefined behaviour one way or another simply does not exist. So the question becomes, is that a fault with all our software or are the language standards simply decades behind common practice? Wilco
Rich Felker wrote: > I think there may actually be a bug (just suboptimality) in the > two-way implementation related to this case. It's a periodic needle > that uses 'memory', and glibc has (abbreviated): > > if (memory && shift < period) > shift = needle_len - period; > I'm pretty sure this is a logic bug. Except for bytes that don't > appear in the needle (for which shift==needle_len), shift < period is > automatically true. The needle being periodic means that each byte > which appears in it appears within the last 'period' positions of it. > > I suspect what was actually intended was: > > if (memory && shift < needle_len - period) > shift = needle_len - period; > > which can be simplified to: > > if (shift < memory) > shift = memory; Those are all equivalent indeed, the last version gives ~2.5% speedup. Interestingly this path is still twice as slow as the short needle code - it used to be 3.5-4 times as slow before I optimized the slow AVAILABLE macro. It's a shame performance of strstr has been held back so long both by a slow algorithm and a slow implementation... Wilco
On Tue, Nov 06, 2018 at 01:02:27PM +0000, Wilco Dijkstra wrote: > Rich Felker wrote: > > > This "quicksearch" is fundamentally a bad algorithm. If it looks fast > > in some cases, it's just a matter of getting lucky, and most likely > > due to bad tuning or missed optimizations in the implementation of the > > algorithm you're comparing it against. It's fundamentally a naive > > O(nm) search that will be this bad in all cases except where the > > bad-character table happens to let you advance a lot. > > I disagree. What's important is real world performance. Practically all Real world performance includes performance under the control of a hostile input source, under which naive algorithms will perform hundreds of times slower. This "quicksearch" is incredibly slow on a{253}ba, and it's as slow as the most naive possible strstr whenever the last character is abundant in the haystack. > good search algorithms are O(nm) and they beat any O(n) algorithm by > huge factors on actual inputs. That should teach you that a concept like > the O notation does not guarantee good performance. By a small factor at best, not a huge factor. > > Do you have a basis for calling this "slow"? It should be roughly one > > comparison per byte. "Only 3x faster" doesn't sound slow to me. > > One comparison per byte is slow indeed. Modern search algorithms > are sublinear and achieve O(n/m) comparisons. You can see this when > benchmarking - performance actually increases with longer needles. Two-way of course achieves this sublinear performance for common cases, but it's only possible for a very limited set, not in general, and does not scale. In particular, once the length of the needle exceeds the size of the alphabet, it can't continue to scale this way. > >> I don't think there is any gain slowing down common cases in order to > >> reduce the worst-case difference (assuming that is feasibe). > > > > This argument gets made about strstr every couple years, and every > > time it's wrong. It's not that "common cases" (as a general class) are > > faster with any naive algorithm. Only very specific cases will ever be > > faster. Most common cases are roughly the same speed as with a good > > algorithm, and some common cases, plus a lot more uncommon ones, are > > catastrophically slower with a bad algorithm. > > Actually it's the correct argument. Common cases are indeed faster with > modern algorithms - slow algorithms like Two-way can't speed up those Two-way is essentially the most modern algorithm there is. Have you read any of the literature on this topic? > common cases. There is only very specific small set of inputs which can > trigger worst-case behaviour, and those are never typical inputs. And it's > easy to fall back to a different algorithm in such a case. Citation needed. If it's easy to detect them, prove it. You'll find it's not. You can of course do an introsearch that counts the number of times it backtracks and switches to two-way, but by that time you will have spent many times the setup cost of two-way and it will be a lot slower. > > To my knowledge there is no good test for a case that won't become > > catastrophically slower with a bad algorithm. If there is, I would be > > very surprised if the same test doesn't naturally translate into an > > optimized setup for two-way. > > Actually it's trivial to find the worst case for each algorithm, you just ensure > you hit the slowest path all the time. This is not a question of finding the worst case. It's a question of efficiently evaluating if a given case is one of the bad ones. > > Calling blatant undefined behavior "a theoretical issue" is not > > something we should be doing in 2018. > > It's 100% theoretical - or you can show an example that could fail? With UBSan and LTO it should trap and abort. > It's crazy to even bring this stuff up in the 21th century. Software that does > not rely on undefined behaviour one way or another simply does not exist. > So the question becomes, is that a fault with all our software or are the language > standards simply decades behind common practice? This kind of claim is not making your case. Rich
On Tue, 6 Nov 2018, Rich Felker wrote: > Real world performance includes performance under the control of a > hostile input source, under which naive algorithms will perform > hundreds of times slower. This "quicksearch" is incredibly slow on Indeed. I think strstr failing to be O(m+n) worst case is clearly a security bug; when you have an upper bound on m for an O(mn) algorithm to avoid a quadratic worst case, the only question is at what point that bound gets too high so the worst case performance is excessively bad. I've just filed bug 23865 for wcsstr using a naive quadratic-time implementation.
Rich Felker wrote: > Real world performance includes performance under the control of a > hostile input source, under which naive algorithms will perform > hundreds of times slower. This "quicksearch" is incredibly slow on > a{253}ba, and it's as slow as the most naive possible strstr whenever > the last character is abundant in the haystack. That's not true at all - in reality there is hardly any difference in worst case performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way is only about 16% faster. If the worst-case performance of Two-way was deemed acceptable, then the worst-case of Quicksearch must be too. >> good search algorithms are O(nm) and they beat any O(n) algorithm by >> huge factors on actual inputs. That should teach you that a concept like >> the O notation does not guarantee good performance. > > By a small factor at best, not a huge factor. Getting another 2-3x speedup after doubling performance is a huge factor. The previous implementation was doing effectively nothing useful 75-85% of the time... >> One comparison per byte is slow indeed. Modern search algorithms >> are sublinear and achieve O(n/m) comparisons. You can see this when >> benchmarking - performance actually increases with longer needles. > > Two-way of course achieves this sublinear performance for common > cases, but it's only possible for a very limited set, not in general, > and does not scale. In particular, once the length of the needle > exceeds the size of the alphabet, it can't continue to scale this way. Two-way is not sublinear as often as Quicksearch indeed, and that's an inherent drawback of the algorithm. The alphabet size is not relevant, it's about the occurence frequencies of characters in the input - a bad character table happily jumps ahead by a million if a character does not appear in a huge needle. It's just unlikely to happen frequently on typical inputs. So for large inputs you don't see performance increasing further but level off. > Two-way is essentially the most modern algorithm there is. Have you > read any of the literature on this topic? Two-way is almost 3 decades old now. It's not modern in any sense of the word. I've looked at lots of papers of modern search algorithms, practically all are about real-world performance (on English, DNA and gene sequences) and are O(nm) - few papers consider real-time or constant-space searching. >> common cases. There is only very specific small set of inputs which can >> trigger worst-case behaviour, and those are never typical inputs. And it's >> easy to fall back to a different algorithm in such a case. > > Citation needed. If it's easy to detect them, prove it. You'll find > it's not. You can of course do an introsearch that counts the number > of times it backtracks and switches to two-way, but by that time you > will have spent many times the setup cost of two-way and it will be a > lot slower. It's trivial by keeping track of how many comparisons you do vs how much progress you make, eg. comparisons_done > N * (cur_pos - start_pos) for a fixed N. But you can make it even simpler and do what I did - needles larger than 256 are rare and so their performance is less important. >> Actually it's trivial to find the worst case for each algorithm, you just ensure >> you hit the slowest path all the time. > > This is not a question of finding the worst case. It's a question of > efficiently evaluating if a given case is one of the bad ones. Why on earth would you want to do that dynamically? We can compare worst cases offline and choose what algorithm to run for particular needle/haystack sizes. And that is sufficient to ensure linear time in the worst case. >> > Calling blatant undefined behavior "a theoretical issue" is not >> > something we should be doing in 2018. >> >> It's 100% theoretical - or you can show an example that could fail? > > With UBSan and LTO it should trap and abort. I mean a real-world failure on an actual supported target, not something that's not actually possible. Wilco
On Tue, Nov 06, 2018 at 06:26:57PM +0000, Wilco Dijkstra wrote: > Rich Felker wrote: > > > Real world performance includes performance under the control of a > > hostile input source, under which naive algorithms will perform > > hundreds of times slower. This "quicksearch" is incredibly slow on > > a{253}ba, and it's as slow as the most naive possible strstr whenever > > the last character is abundant in the haystack. > > That's not true at all - in reality there is hardly any difference in worst case > performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on > Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way > is only about 16% faster. If the worst-case performance of Two-way was > deemed acceptable, then the worst-case of Quicksearch must be too. I don't know what figures you're talking about, but the worst-case for "quicksearch" with a bound of 256 bytes is easily over 100x slower than the worst case of two-way. > >> good search algorithms are O(nm) and they beat any O(n) algorithm by > >> huge factors on actual inputs. That should teach you that a concept like > >> the O notation does not guarantee good performance. > > > > By a small factor at best, not a huge factor. > > Getting another 2-3x speedup after doubling performance is a huge factor. > The previous implementation was doing effectively nothing useful 75-85% > of the time... Then fix implementation bugs. Don't replace good algorithms with awful ones. For any given needle, it's definitely possible to make it such that,... > >> One comparison per byte is slow indeed. Modern search algorithms > >> are sublinear and achieve O(n/m) comparisons. You can see this when > >> benchmarking - performance actually increases with longer needles. > > > > Two-way of course achieves this sublinear performance for common > > cases, but it's only possible for a very limited set, not in general, > > and does not scale. In particular, once the length of the needle > > exceeds the size of the alphabet, it can't continue to scale this way. > > Two-way is not sublinear as often as Quicksearch indeed, and that's an > inherent drawback of the algorithm. It's sublinear in a superset of the cases where "quicksearch" is sublinear: whever you get to use the bad character table (which works the same for both algorithms), *plus* when a mismatch in the left factor lets you advance by a whole period. This latter class of cases mostly helps when the period is large and the right factor is short, but that's an important class of real-world cases (most needles are non-periodic, i.e. period = strlen(needle), and the right factor is very short). In some cases, the maximal-suffix algorithm selects a gratuitously long right factor; it may be possible to optimize this selection further. > The alphabet size is not relevant, it's I wasn't precise in stating what I meant here but it's not important. > about the occurence frequencies of characters in the input - a bad character > table happily jumps ahead by a million if a character does not appear in a > huge needle. It's just unlikely to happen frequently on typical inputs. So for > large inputs you don't see performance increasing further but level off. Worst-case is never about what's likely to happen in the haystack but what can happen. > > Two-way is essentially the most modern algorithm there is. Have you > > read any of the literature on this topic? > > Two-way is almost 3 decades old now. It's not modern in any sense of the > word. I've looked at lots of papers of modern search algorithms, practically > all are about real-world performance (on English, DNA and gene sequences) > and are O(nm) - few papers consider real-time or constant-space searching. For DNA, quadratic-time is even worse. > >> common cases. There is only very specific small set of inputs which can > >> trigger worst-case behaviour, and those are never typical inputs. And it's > >> easy to fall back to a different algorithm in such a case. > > > > Citation needed. If it's easy to detect them, prove it. You'll find > > it's not. You can of course do an introsearch that counts the number > > of times it backtracks and switches to two-way, but by that time you > > will have spent many times the setup cost of two-way and it will be a > > lot slower. > > It's trivial by keeping track of how many comparisons you do vs how much > progress you make, eg. comparisons_done > N * (cur_pos - start_pos) for > a fixed N. But you can make it even simpler and do what I did - needles > larger than 256 are rare and so their performance is less important. Quicksort performs *utterly abysmally* on plenty of needles shorter than 256 or even shorter than 50. You cannot use it based on the length of the needle except possibly for very short needles (maybe <16), and even then you will do worse on a lot of real-world cases. > >> Actually it's trivial to find the worst case for each algorithm, you just ensure > >> you hit the slowest path all the time. > > > > This is not a question of finding the worst case. It's a question of > > efficiently evaluating if a given case is one of the bad ones. > > Why on earth would you want to do that dynamically? We can compare worst > cases offline and choose what algorithm to run for particular needle/haystack > sizes. And that is sufficient to ensure linear time in the worst case. You can do this if you're implementing a custom search for your application. You cannot do this for the libc strstr because it does not know what the application is. It needs to be able to perform non-pathologically for any inputs. > >> > Calling blatant undefined behavior "a theoretical issue" is not > >> > something we should be doing in 2018. > >> > >> It's 100% theoretical - or you can show an example that could fail? > > > > With UBSan and LTO it should trap and abort. > > I mean a real-world failure on an actual supported target, not something > that's not actually possible. This is a real-world near-future target, and introducing gratuitous breakage for it now is utterly backwards, especially when it's so easy to avoid (skipping the -- ahead of time and just adding a -1 to each []). Rich
On Tue, 6 Nov 2018, Rich Felker wrote: > > > >> > Calling blatant undefined behavior "a theoretical issue" is not > > >> > something we should be doing in 2018. > > >> > > >> It's 100% theoretical - or you can show an example that could fail? > > > > > > With UBSan and LTO it should trap and abort. > > > > I mean a real-world failure on an actual supported target, not something > > that's not actually possible. > > This is a real-world near-future target, and introducing gratuitous > breakage for it now is utterly backwards, especially when it's so easy > to avoid (skipping the -- ahead of time and just adding a -1 to each > []). A simpler fix is available: 'hs++, hs_len--' after the initial optimistic memcmp has failed. Alexander
Rich Felker wrote: > On Tue, Nov 06, 2018 at 06:26:57PM +0000, Wilco Dijkstra wrote: >> That's not true at all - in reality there is hardly any difference in worst case >> performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on >> Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way >> is only about 16% faster. If the worst-case performance of Two-way was >> deemed acceptable, then the worst-case of Quicksearch must be too. > > I don't know what figures you're talking about, but the worst-case for > "quicksearch" with a bound of 256 bytes is easily over 100x slower > than the worst case of two-way. Not a chance. You do realize I am measuring this stuff for real rather than just making up random numbers? Worst-case performance of both algorithms is very similar for needles up to 256. It's not possible to get 10x difference, let alone 100x. >> Getting another 2-3x speedup after doubling performance is a huge factor. >> The previous implementation was doing effectively nothing useful 75-85% >> of the time... > > Then fix implementation bugs. Don't replace good algorithms with awful > ones. For any given needle, it's definitely possible to make it such > that,... I already did that, remember? If you believe you can do better then post further performance improvements. >> > Two-way is essentially the most modern algorithm there is. Have you >> > read any of the literature on this topic? >> >> Two-way is almost 3 decades old now. It's not modern in any sense of the >> word. I've looked at lots of papers of modern search algorithms, practically >> all are about real-world performance (on English, DNA and gene sequences) >> and are O(nm) - few papers consider real-time or constant-space searching. > > For DNA, quadratic-time is even worse. The goal is to be fast on real data, not some made up worst case. Actual DNA inputs don't hit bad quadratic cases. So researchers don't bother with the slower linear algorithms. > Quicksort performs *utterly abysmally* on plenty of needles shorter > than 256 or even shorter than 50. You cannot use it based on the > length of the needle except possibly for very short needles (maybe > <16), and even then you will do worse on a lot of real-world cases. Citation needed. You clearly haven't ever tried Quicksearch. It beats Two-way in every single case on real inputs. The length of the needle matters only for the worst case, and by limiting the needle size, you avoid quadratic behaviour. > You can do this if you're implementing a custom search for your > application. You cannot do this for the libc strstr because it does > not know what the application is. It needs to be able to perform > non-pathologically for any inputs. Limiting the needle size to avoid quadratic behaviour is perfectly fine for GLIBC since the goal is to avoid denial of service via specially crafted inputs. It's impossible for strstr to always have identical performance on all possible inputs - practically every algorithm has fast and slow cases. Rabin-Karp is perhaps an exception given a good hash function. It's also 10x faster than Two-way slow cases, so it's a candidate to replace Two-way. >> >> > Calling blatant undefined behavior "a theoretical issue" is not >> >> > something we should be doing in 2018. >> >> >> >> It's 100% theoretical - or you can show an example that could fail? >> > >> > With UBSan and LTO it should trap and abort. >> >> I mean a real-world failure on an actual supported target, not something >> that's not actually possible. > > This is a real-world near-future target, and introducing gratuitous > breakage for it now is utterly backwards, especially when it's so easy > to avoid (skipping the -- ahead of time and just adding a -1 to each > []). It seems unlikely it would ever work across dynamic library boundaries, but even if you replace the haystack input in strstr with a hardcoded array, UBSan does not report a problem. The only way I could make it report anything at all was to use hs = &s[-1]. Wilco
On Wed, Nov 07, 2018 at 04:01:51PM +0000, Wilco Dijkstra wrote: > Rich Felker wrote: > > On Tue, Nov 06, 2018 at 06:26:57PM +0000, Wilco Dijkstra wrote: > > >> That's not true at all - in reality there is hardly any difference in worst case > >> performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on > >> Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way > >> is only about 16% faster. If the worst-case performance of Two-way was > >> deemed acceptable, then the worst-case of Quicksearch must be too. > > > > I don't know what figures you're talking about, but the worst-case for > > "quicksearch" with a bound of 256 bytes is easily over 100x slower > > than the worst case of two-way. > > Not a chance. You do realize I am measuring this stuff for real rather than > just making up random numbers? Worst-case performance of both algorithms > is very similar for needles up to 256. It's not possible to get 10x difference, let > alone 100x. I'm not either, but we are measuring different implementations, particularly of the underlying memcmp. But a better memcmp is not going to make it 100x faster. Maybe 10x at best, leaving a 10x difference still. And not all archs have a ridiculously fast memcmp. Indeed, just re-testing on a glibc box, "quicksearch" takes 33 whole microseconds to search (and not find) a{90}ba in (ba{89}){9}. Two-way can do the same in 3 microseconds on this box, although glibc's (2.24 from Debian) included two-way on this box is much slower and takes 12. So that's a 10x difference at 92 bytes (limitation of the measurement code from a third party I was using). At 256 bytes I'd expect more like a 25x difference. > >> Getting another 2-3x speedup after doubling performance is a huge factor.. > >> The previous implementation was doing effectively nothing useful 75-85% > >> of the time... > > > > Then fix implementation bugs. Don't replace good algorithms with awful > > ones. For any given needle, it's definitely possible to make it such > > that,... > > I already did that, remember? If you believe you can do better then post further > performance improvements. Nobody has replied to my branch of this thread about limiting shift in the case of memory. I'm pretty confident it's right now so maybe I should just submit a patch. But it's frustrating that there is more interest in discussing bad algorithms than in improving the implementation of a good one. > > Quicksort performs *utterly abysmally* on plenty of needles shorter > > than 256 or even shorter than 50. You cannot use it based on the > > length of the needle except possibly for very short needles (maybe > > <16), and even then you will do worse on a lot of real-world cases. > > Citation needed. You clearly haven't ever tried Quicksearch. It beats Two-way > in every single case on real inputs. The length of the needle matters only for the > worst case, and by limiting the needle size, you avoid quadratic behaviour. See above. I have tried it and it performs abysmally. > > You can do this if you're implementing a custom search for your > > application. You cannot do this for the libc strstr because it does > > not know what the application is. It needs to be able to perform > > non-pathologically for any inputs. > > Limiting the needle size to avoid quadratic behaviour is perfectly fine for GLIBC > since the goal is to avoid denial of service via specially crafted inputs. > > It's impossible for strstr to always have identical performance on all possible > inputs - practically every algorithm has fast and slow cases. Rabin-Karp is > perhaps an exception given a good hash function. It's also 10x faster than > Two-way slow cases, so it's a candidate to replace Two-way. It's also quadratic, so it's not a candidate, unless you cap the length for which it's used very low. > >> >> > Calling blatant undefined behavior "a theoretical issue" is not > >> >> > something we should be doing in 2018. > >> >> > >> >> It's 100% theoretical - or you can show an example that could fail? > >> > > >> > With UBSan and LTO it should trap and abort. > >> > >> I mean a real-world failure on an actual supported target, not something > >> that's not actually possible. > > > > This is a real-world near-future target, and introducing gratuitous > > breakage for it now is utterly backwards, especially when it's so easy > > to avoid (skipping the -- ahead of time and just adding a -1 to each > > []). > > It seems unlikely it would ever work across dynamic library boundaries, but Static linking is valid and supported usage. > even if you replace the haystack input in strstr with a hardcoded array, UBSan > does not report a problem. The only way I could make it report anything at all > was to use hs = &s[-1]. If it fails to detect it, that's a bug in UBSan that will hopefully be fixed at some point. I've actually had serious memory corruption bugs that would have been caught by correctly trapping invalid pointer arithmetic outside the bounds of the object, but weren't, so I'm aware that it's not perfect and have a significant interest in its being improved in the future. Rich
On Wed, Nov 07, 2018 at 02:28:45PM -0500, Rich Felker wrote: > > >> Getting another 2-3x speedup after doubling performance is a huge factor.. > > >> The previous implementation was doing effectively nothing useful 75-85% > > >> of the time... > > > > > > Then fix implementation bugs. Don't replace good algorithms with awful > > > ones. For any given needle, it's definitely possible to make it such > > > that,... > > > > I already did that, remember? If you believe you can do better then post further > > performance improvements. > > Nobody has replied to my branch of this thread about limiting shift in > the case of memory. I'm pretty confident it's right now so maybe I > should just submit a patch. But it's frustrating that there is more > interest in discussing bad algorithms than in improving the > implementation of a good one. On further reading, it looks like the whole "performance problem" is probably lack of a machine-optimized comparison loop for the two (forward and backwards) comparisons in two-way. If we had "rep cmp" functions for each direction (optimized to use specialized insns or vector ops, possibly misaligned if allowed by the ISA, taking into account that we know the needle length and a lower bound on the haystack length so that there's no overread issue) returning the first mismatching offset, I see no reason why there should be any cases that perform worse than "quicksearch", and pretty much all cases should speed up a lot vs the status quo. Ideally the "rep cmp" functions should be inline containing inline asm fragments so they can be inlined right into the loop without any call overhead, but this probably only matters for short needles. Rich
Rich Felker wrote: > On Wed, Nov 07, 2018 at 04:01:51PM +0000, Wilco Dijkstra wrote: >> Not a chance. You do realize I am measuring this stuff for real rather than >> just making up random numbers? Worst-case performance of both algorithms >> is very similar for needles up to 256. It's not possible to get 10x difference, let >> alone 100x. > > I'm not either, but we are measuring different implementations, > particularly of the underlying memcmp. But a better memcmp is not > going to make it 100x faster. Maybe 10x at best, leaving a 10x > difference still. And not all archs have a ridiculously fast memcmp. A slow memcmp is obviously not an issue with the algorithm or implementation. And it's something that can be easily fixed. It might even be feasible to mitigate for a slow memcmp, for example splitting the memcmp into several smaller ones. > Indeed, just re-testing on a glibc box, "quicksearch" takes 33 whole > microseconds to search (and not find) a{90}ba in (ba{89}){9}. Two-way > can do the same in 3 microseconds on this box, although glibc's (2.24 > from Debian) included two-way on this box is much slower and takes 12. > So that's a 10x difference at 92 bytes (limitation of the measurement > code from a third party I was using). At 256 bytes I'd expect more > like a 25x difference. Well using that input with a slow byte-by-byte memcmp, Quicksearch is less than 4 times as slow as Two-way on it's worst case. The same input of size 254 is 9.3 times slower. But even this worst-worst case is not particularly slow and easily fixable by not using a stupid memcmp. Or you could limit the needle size to something smaller for such a target. > Nobody has replied to my branch of this thread about limiting shift in > the case of memory. I'm pretty confident it's right now so maybe I > should just submit a patch. But it's frustrating that there is more > interest in discussing bad algorithms than in improving the > implementation of a good one. I replied and benchmarked it - it's correct but doesn't make a significant difference. While some further tweaks are possible, I don't believe that Two-way could ever be competitive with any of the modern algorithms - it needs to be at least 2-3 times faster, but you just can't speedup the matching process due to the way it needs to be done in order to stay linear. However if you feel strongly that you can write a faster version (or even a different algorithm), then just go ahead. This implementation (and similar ones in gnulib, newlib, musl, bsd etc) is quite ancient, the many hacks and badly integrated bad-char table hasn't done it any good. >> Citation needed. You clearly haven't ever tried Quicksearch. It beats Two-way >> in every single case on real inputs. The length of the needle matters only for the >> worst case, and by limiting the needle size, you avoid quadratic behaviour. > > See above. I have tried it and it performs abysmally. No. All you showed is that performance is bad if you use a slow byte-by-byte memcmp on a slow microarchitecture. That certainly doesn't prove anything about Quicksearch. >> Limiting the needle size to avoid quadratic behaviour is perfectly fine for GLIBC >> since the goal is to avoid denial of service via specially crafted inputs. >> >> It's impossible for strstr to always have identical performance on all possible >> inputs - practically every algorithm has fast and slow cases. Rabin-Karp is >> perhaps an exception given a good hash function. It's also 10x faster than >> Two-way slow cases, so it's a candidate to replace Two-way. > > It's also quadratic, so it's not a candidate, unless you cap the > length for which it's used very low. Only if you manage to break the hash function. RK is actually more interesting for really large needles given performance is practically independent of the size of the needle. It not only handles the worst cases of other algorithms much faster but has near identical performance for any input. Wilco
Hi, Rich Felker wrote: > On further reading, it looks like the whole "performance problem" is > probably lack of a machine-optimized comparison loop for the two > (forward and backwards) comparisons in two-way. If we had "rep cmp" > functions for each direction (optimized to use specialized insns or > vector ops, possibly misaligned if allowed by the ISA, taking into > account that we know the needle length and a lower bound on the > haystack length so that there's no overread issue) returning the first > mismatching offset, I see no reason why there should be any cases that > perform worse than "quicksearch", and pretty much all cases should > speed up a lot vs the status quo. > > Ideally the "rep cmp" functions should be inline containing inline asm > fragments so they can be inlined right into the loop without any call > overhead, but this probably only matters for short needles. This won't help at all. The character matching loops don't show up in profiles. Almost all time is spent in finding the first character, ie. strchr (short needles) and the Horsepool loop (long needles). Using Quicksearch to find a prefix of the suffix string should be faster. Wilco
On Fri, Nov 09, 2018 at 12:46:54PM +0000, Wilco Dijkstra wrote: > Hi, > > Rich Felker wrote: > > On further reading, it looks like the whole "performance problem" is > > probably lack of a machine-optimized comparison loop for the two > > (forward and backwards) comparisons in two-way. If we had "rep cmp" > > functions for each direction (optimized to use specialized insns or > > vector ops, possibly misaligned if allowed by the ISA, taking into > > account that we know the needle length and a lower bound on the > > haystack length so that there's no overread issue) returning the first > > mismatching offset, I see no reason why there should be any cases that > > perform worse than "quicksearch", and pretty much all cases should > > speed up a lot vs the status quo. > > > > Ideally the "rep cmp" functions should be inline containing inline asm > > fragments so they can be inlined right into the loop without any call > > overhead, but this probably only matters for short needles. > > This won't help at all. The character matching loops don't show up in profiles. > Almost all time is spent in finding the first character, ie. strchr (short needles) > and the Horsepool loop (long needles). Using Quicksearch to find a prefix of > the suffix string should be faster. Which profiles? The common cases (where significant-length prefix matches are usually rare) or the worst cases? For needles that factor poorly (long right factor), the latter breaks down to essentially repeated disjoint memcmp's of the right factor against the whole haystack, in successive order, starting over at the beginning of the needle's right factor whenever there's a mismatch. That sounds highly "memcmp-bound" to me. If worst cases are something you care about improving vs the status quo, this is worth pursuing, but if not, okay. I agree that for common cases there are potentially much better improvements to be made elsewhere. strchr is of course a big help when the character to be searched is rare but useless (and pessimizes from call/setup overhead unless you can inline it) if it's frequent. It's probably better to search for the first up-to-N characters of the right factor, where N is a word/vector size, rotating bytes of the haystack in/out of a running word/vector for comparison (musl's strstr does this for needle sizes up to 4 already, and some data has shown it's usually better than two-way a bit further on 64-bit systems). Performing the critical factorization via the maximal suffix process described in the two-way paper is also suboptimal, and dependent on the actual order of the alphabet. For example hello factors as [hell][o] but hellk factors as [he][llk] despite them being structurally isomorphic under a renaming of the alphabet. In some cases, optimal factorizations can be determined by short-circuiting past the MS process. For example if the loop setting up the bad char table determines that the final character of the needle is the only occurrance of that character, [l-1][1] is a critical factorization and p=l. There are larger classes of this type of shortcut but I'm not sure what's efficient/reasonable-tradeoff to do. Searching for a short right factor is appealing though because it means you get to skip by the whole period as soon as there's any mismatch on the left factor. It might be possible to choose (at least heuristically) some permutation of the alphabet (probably anything more than a rotation of it is not practical, though) that optimizes/improves the MS factorization results though. For example, in the "hellk" example, rotating the alphabet so that 'k' is the first or last character would make it factor as [hell][k] rather than [he][llk]. Rich
Hi, Rich Felker wrote: > Which profiles? The common cases (where significant-length prefix > matches are usually rare) or the worst cases? Both - the profiles vary between the different code paths, but I don't see long matches for good or bad cases. I made the slow cases significantly slower by using a haystack of random 'a' and 'b' chars. This causes lots of branch mispredictions by alternating between the slow paths that search for the first character. > For needles that factor > poorly (long right factor), the latter breaks down to essentially > repeated disjoint memcmp's of the right factor against the whole > haystack, in successive order, starting over at the beginning of the > needle's right factor whenever there's a mismatch. That sounds highly > "memcmp-bound" to me. If worst cases are something you care about > improving vs the status quo, this is worth pursuing, but if not, okay. Eventhough it does only 1 character per iteration, finding the first matching character is more complex and slower, so I haven't seen a case where it is the bottleneck. > I agree that for common cases there are potentially much better > improvements to be made elsewhere. strchr is of course a big help when > the character to be searched is rare but useless (and pessimizes from > call/setup overhead unless you can inline it) if it's frequent. It's > probably better to search for the first up-to-N characters of the > right factor, where N is a word/vector size, rotating bytes of the > haystack in/out of a running word/vector for comparison (musl's strstr > does this for needle sizes up to 4 already, and some data has shown > it's usually better than two-way a bit further on 64-bit systems). Yes searching for a fixed size substring would work better indeed since you get far fewer false matches than when searching for a single char. I don't think brute force word-based matching is fast enough. I tried a variant using unaligned accesses which are much faster. However Quicksearch still beats it. The key is you can use any fast algorithm given the substring would be fixed-size, so linear time is guaranteed. > Performing the critical factorization via the maximal suffix process > described in the two-way paper is also suboptimal, and dependent on > the actual order of the alphabet. For example hello factors as > [hell][o] but hellk factors as [he][llk] despite them being > structurally isomorphic under a renaming of the alphabet. In some > cases, optimal factorizations can be determined by short-circuiting > past the MS process. For example if the loop setting up the bad char > table determines that the final character of the needle is the only > occurrance of that character, [l-1][1] is a critical factorization and > p=l. There are larger classes of this type of shortcut but I'm not > sure what's efficient/reasonable-tradeoff to do. Indeed, the critical factorization doesn't appear optimal but I haven't seen any research that tries to improve it. There must be better ways of doing it since all you want is a short fixed-size substring with the right properties. However finding an algorithm that does this efficiently might not be trivial. > Searching for a short > right factor is appealing though because it means you get to skip by > the whole period as soon as there's any mismatch on the left factor. Indeed. And apart from highly repetitive needles it would allow for a simple memcmp for matching the needle. Wilco
diff --git a/string/str-two-way.h b/string/str-two-way.h index 523d946c59412e1f1f65b8ec3778553b83191952..31c3f18fb057cdd999c3ac9e9d894a8b62a98a70 100644 --- a/string/str-two-way.h +++ b/string/str-two-way.h @@ -221,7 +221,7 @@ critical_factorization (const unsigned char *needle, size_t needle_len, most 2 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching. If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching. */ -static RETURN_TYPE +static inline RETURN_TYPE two_way_short_needle (const unsigned char *haystack, size_t haystack_len, const unsigned char *needle, size_t needle_len) { diff --git a/string/strstr.c b/string/strstr.c index f74d7189ed1319f6225525cc2d32380745de1523..206f20ddfcf10f61ba68725b516ef00356e9028c 100644 --- a/string/strstr.c +++ b/string/strstr.c @@ -22,11 +22,8 @@ # include <config.h> #endif -/* Specification of strstr. */ #include <string.h> -#include <stdbool.h> - #ifndef _LIBC # define __builtin_expect(expr, val) (expr) #endif @@ -47,46 +44,115 @@ #define STRSTR strstr #endif -/* Return the first occurrence of NEEDLE in HAYSTACK. Return HAYSTACK - if NEEDLE is empty, otherwise NULL if NEEDLE is not found in - HAYSTACK. */ -char * -STRSTR (const char *haystack, const char *needle) +static inline char * +strstr2 (const unsigned char *hs, const unsigned char *ne) { - size_t needle_len; /* Length of NEEDLE. */ - size_t haystack_len; /* Known minimum length of HAYSTACK. */ + uint32_t h1 = (ne[0] << 16) | ne[1]; + uint32_t h2 = 0; + for (int c = hs[0]; h1 != h2 && c != 0; c = *++hs) + h2 = (h2 << 16) | c; + return h1 == h2 ? (char *)hs - 2 : NULL; +} - /* Handle empty NEEDLE special case. */ - if (needle[0] == '\0') - return (char *) haystack; +static inline char * +strstr3 (const unsigned char *hs, const unsigned char *ne) +{ + uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8); + uint32_t h2 = 0; + for (int c = hs[0]; h1 != h2 && c != 0; c = *++hs) + h2 = (h2 | c) << 8; + return h1 == h2 ? (char *)hs - 3 : NULL; +} - /* Skip until we find the first matching char from NEEDLE. */ - haystack = strchr (haystack, needle[0]); - if (haystack == NULL || needle[1] == '\0') - return (char *) haystack; +static inline char * +strstr4 (const unsigned char *hs, const unsigned char *ne) +{ + uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8) | ne[3]; + uint32_t h2 = 0; + for (int c = hs[0]; c != 0 && h1 != h2; c = *++hs) + h2 = (h2 << 8) | c; + return h1 == h2 ? (char *)hs - 4 : NULL; +} - /* Ensure HAYSTACK length is at least as long as NEEDLE length. - Since a match may occur early on in a huge HAYSTACK, use strnlen +/* Number of bits used to index shift table. */ +#define SHIFT_TABLE_BITS 6 + +/* Extremely fast strstr algorithm with guaranteed linear-time performance. + Small needles up to size 4 use a dedicated linear search. Longer needles + up to size 254 use Sunday's Quick-Search algorithm. Due to its simplicity + it has the best average performance of string matching algorithms on almost + all inputs. It uses a bad-character shift table to skip past mismatches. + By limiting the needle length to 254, the shift table can be reduced to 8 + bits per entry, lowering preprocessing overhead and minimizing cache effects. + The limit also implies the worst-case performance is linear. + Even larger needles are processed by the linear-time Two-Way algorithm. +*/ +char * +STRSTR (const char *haystack, const char *needle) +{ + const unsigned char *hs = (const unsigned char *) haystack; + const unsigned char *ne = (const unsigned char *) needle; + + /* Handle short needle special cases first. */ + if (ne[0] == '\0') + return (char *)hs; + hs = (const unsigned char *)strchr ((const char*)hs, ne[0]); + if (hs == NULL || ne[1] == '\0') + return (char*)hs; + if (ne[2] == '\0') + return strstr2 (hs, ne); + if (ne[3] == '\0') + return strstr3 (hs, ne); + if (ne[4] == '\0') + return strstr4 (hs, ne); + + /* Ensure haystack length is at least as long as needle length. + Since a match may occur early on in a huge haystack, use strnlen and read ahead a few cachelines for improved performance. */ - needle_len = strlen (needle); - haystack_len = __strnlen (haystack, needle_len + 256); - if (haystack_len < needle_len) + size_t ne_len = strlen ((const char*)ne); + size_t hs_len = strnlen ((const char*)hs, ne_len | 512); + if (hs_len < ne_len) return NULL; - /* Check whether we have a match. This improves performance since we avoid - the initialization overhead of the two-way algorithm. */ - if (memcmp (haystack, needle, needle_len) == 0) - return (char *) haystack; - - /* Perform the search. Abstract memory is considered to be an array - of 'unsigned char' values, not an array of 'char' values. See - ISO C 99 section 6.2.6.1. */ - if (needle_len < LONG_NEEDLE_THRESHOLD) - return two_way_short_needle ((const unsigned char *) haystack, - haystack_len, - (const unsigned char *) needle, needle_len); - return two_way_long_needle ((const unsigned char *) haystack, haystack_len, - (const unsigned char *) needle, needle_len); + /* Check whether we have a match. This improves performance since we + avoid initialization overheads. */ + if (memcmp (hs, ne, ne_len) == 0) + return (char *) hs; + + /* Use the Quick-Search algorithm for needle lengths less than 255. */ + if (__glibc_likely (ne_len < 255)) + { + uint8_t shift[1 << SHIFT_TABLE_BITS]; + const unsigned char *end = hs + hs_len - ne_len; + + /* Initialize bad character shift hash table. */ + memset (shift, ne_len + 1, sizeof (shift)); + for (int i = 0; i < ne_len; i++) + shift[ne[i] % sizeof (shift)] = ne_len - i; + + do + { + hs--; + + /* Search by skipping past bad characters. */ + size_t tmp = shift[hs[ne_len] % sizeof (shift)]; + for (hs += tmp; hs <= end; hs += tmp) + { + tmp = shift[hs[ne_len] % sizeof (shift)]; + if (memcmp (hs, ne, ne_len) == 0) + return (char*)hs; + } + if (end[ne_len] == '\0') + return NULL; + end += strnlen ((const char*)end + ne_len, 2048); + } + while (hs <= end); + + return NULL; + } + + /* Use Two-Way algorithm for very long needles. */ + return two_way_long_needle (hs, hs_len, ne, ne_len); } libc_hidden_builtin_def (strstr)