Message ID | 20150513185107.GA4100@domone |
---|---|
State | New |
Headers | show |
Ondřej Bílka wrote: > You wont get illegal acces. I don't see why not. If memchr returns haystack_end - 1, haystack[1] will be equivalent to *haystack_end, i.e., one past the end of the haystack. > Here is equivalent without goto. It uses implementation behaviour that > memcmp(x,y,0) doesn't touch anything. Better, but I think it still has the problem. Also, come to think of it, it's not clear that we should bother with the 'needle_len == 2' test (are needles of length 2 really that common?), and things are simpler and smaller without it, so how about the attached instead?
On Wed, May 13, 2015 at 06:06:58PM -0700, Paul Eggert wrote: > Ondřej Bílka wrote: > > >You wont get illegal acces. > > I don't see why not. If memchr returns haystack_end - 1, > haystack[1] will be equivalent to *haystack_end, i.e., one past the > end of the haystack. > I am using different end here + const unsigned char *haystack_end = (const unsigned char *) + haystack_start + haystack_len + - needle_len + 1; > >Here is equivalent without goto. It uses implementation behaviour that > >memcmp(x,y,0) doesn't touch anything. > > Better, but I think it still has the problem. Also, come to think > of it, it's not clear that we should bother with the 'needle_len == > 2' test (are needles of length 2 really that common?), and things > are simpler and smaller without it, so how about the attached > instead? > Reason is performance. Idea is to use this algorithm and switch to slow two-way algorithm only for pathological inputs. This uses simple heuristic to look only for first k characters and once it matches character beyond k'th switch to two-way algorithm. With single memchr as now its heuristic with k=1. What are you suggesting is k=2. My algorithm uses k=3. Main motivations is that pairs are still too common and in longer switch you will quickly switch to slow two-way algorithm. I could use more sophisticated accounting of runtime cost but it would likely cost more than benefit as most inputs are quite small. Finally there is buy-or-rent problem. Computing critical factorization is very costy, often several times more expensive than using naive algorithm. A solution to limit loss is buy-or-rent approach. You run a naive algorithm even if its slower until you reach time diffierence equal of doing factorization, then you switch to two-way algorithm and you are at most twice slower than if you guessed correctly what to use.
Ondřej Bílka wrote: > I am using different end here > > + const unsigned char *haystack_end = (const unsigned char *) > + haystack_start + haystack_len > + - needle_len + 1; Ah, sorry, didn't see that. But in that case the name 'haystack_end' is misleading -- that's not the haystack's end, but is something else. So a renaming would appear to be in order. > Main motivations is that pairs are still too common Too common where? Do we have traces of actual programs?
On Thu, May 14, 2015 at 11:08:42PM -0700, Paul Eggert wrote: > Ondřej Bílka wrote: > >I am using different end here > > > >+ const unsigned char *haystack_end = (const unsigned char *) > >+ haystack_start + haystack_len > >+ - needle_len + 1; > > Ah, sorry, didn't see that. But in that case the name > 'haystack_end' is misleading -- that's not the haystack's end, but > is something else. So a renaming would appear to be in order. > Do you have better suggestion? > >Main motivations is that pairs are still too common > > Too common where? Do we have traces of actual programs? I actually have applications that I use have most haystacks less than 64 bytes so it doesn't make difference. However its better to be prepared in case programmer uses kb length haystacks where it would happen. An english digraph th frequency is around 1% so you will likely switch in first 1/10 of input. For triplets there could be same problem but I decided to keep it simple, alternatively could add quadruple check I am open what to use.
diff --git a/string/memmem.c b/string/memmem.c index 8a81f65..088a2f1 100644 --- a/string/memmem.c +++ b/string/memmem.c @@ -46,6 +46,10 @@ __memmem (const void *haystack_start, size_t haystack_len, not an array of 'char' values. See ISO C 99 section 6.2.6.1. */ const unsigned char *haystack = (const unsigned char *) haystack_start; const unsigned char *needle = (const unsigned char *) needle_start; + const unsigned char *haystack_end = (const unsigned char *) + haystack_start + haystack_len + - needle_len + 1; + if (needle_len == 0) /* The first occurrence of the empty string is deemed to occur at @@ -57,6 +61,25 @@ __memmem (const void *haystack_start, size_t haystack_len, if (__glibc_unlikely (haystack_len < needle_len)) return NULL; + if (needle_len == 1) + return memchr (haystack, needle[0], haystack_end - haystack); + + for (; ;haystack++) + { + haystack = memchr (haystack, needle[0], haystack_end - haystack); + + if (!haystack) + return NULL; + + if (haystack[1] == needle[1] + && (needle_len == 2 || haystack[2] == needle[2])) + { + if (!memcmp (haystack + 2, needle + 2, needle_len - 2)) + return (void *) haystack; + break; + } + } + /* Use optimizations in memchr when possible, to reduce the search size of haystack using a linear algorithm with a smaller coefficient. However, avoid memchr for long needles, since we