Message ID | 001301cfcd0a$f0b62670$d2227350$@com |
---|---|
State | New |
Headers | show |
On 09/10/2014 05:21 PM, Wilco Dijkstra wrote: > Yes, you're right, I timed it and there is actually little difference, while > the code is now even simpler. New version below (not attaching results in bad > characters due to various mail servers changing line endings). > > OK for commit? I think you could simplify it down to strnlen, memcpy, and memset.
On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote: > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote: > >Yes, you're right, I timed it and there is actually little difference, while > >the code is now even simpler. New version below (not attaching results in bad > >characters due to various mail servers changing line endings). > > > >OK for commit? > > I think you could simplify it down to strnlen, memcpy, and memset. I don't think that's an improvement, at least not in the general case. It involves iterating twice over the source string, which for long strings could mean blowing the whole cache twice and fetching from main memory twice. There's a good reason that string operations are usually implemented to perform the copy and length computation together in a single pass. Rich
> Florian Weimer wrote: > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote: > > Yes, you're right, I timed it and there is actually little difference, while > > the code is now even simpler. New version below (not attaching results in bad > > characters due to various mail servers changing line endings). > > > > OK for commit? > > I think you could simplify it down to strnlen, memcpy, and memset. I'll have a go at that in a follow-on patch. Something similar is possible with strcpy and strncat which currently do byte-by-byte copies as well. Simplifying the code is certainly a nice benefit besides the performance gain on longer strings. Wilco
> Rich Felker wrote: > On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote: > > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote: > > >Yes, you're right, I timed it and there is actually little difference, while > > >the code is now even simpler. New version below (not attaching results in bad > > >characters due to various mail servers changing line endings). > > > > > >OK for commit? > > > > I think you could simplify it down to strnlen, memcpy, and memset. > > I don't think that's an improvement, at least not in the general case. > It involves iterating twice over the source string, which for long > strings could mean blowing the whole cache twice and fetching from > main memory twice. There's a good reason that string operations are > usually implemented to perform the copy and length computation > together in a single pass. > > Rich Few strings will be larger than the typical L1 size of 32KB. You're right that it is best to do a single pass in a highly optimized implementation. However the issue is that the C versions are so slow that even doing 2 passes will be significantly faster due to processing 8 bytes at a time - likely even if much larger than L1 (I'll check that). The goal of these patches is to ensure the C string routines are quite competitive out of the box, and benefit further when you add a few highly optimized routines (eg. strlen/strcpy). That means new targets are not forced to add optimized versions of all of the string routines in order to get decent performance (as unfortunately is the case today). Wilco
> Rich Felker wrote: > On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote: > > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote: > > >Yes, you're right, I timed it and there is actually little difference, while > > >the code is now even simpler. New version below (not attaching results in bad > > >characters due to various mail servers changing line endings). > > > > > >OK for commit? > > > > I think you could simplify it down to strnlen, memcpy, and memset. > > I don't think that's an improvement, at least not in the general case. > It involves iterating twice over the source string, which for long > strings could mean blowing the whole cache twice and fetching from > main memory twice. There's a good reason that string operations are > usually implemented to perform the copy and length computation > together in a single pass. I did a quick experiment with strcpy as it's simpler. Replacing it with memcpy (d, s, strlen (s) + 1) is 3 times faster even on strings of 16Mbytes! Perhaps more surprisingly, it has similar performance on these huge strings as an optimized strcpy. So the results are pretty clear, if you don't have a super optimized strcpy, then strlen+memcpy is the best way to do it. Wilco
On Thu, Sep 11, 2014 at 08:37:17PM +0100, Wilco Dijkstra wrote: > > Rich Felker wrote: > > On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote: > > > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote: > > > >Yes, you're right, I timed it and there is actually little difference, while > > > >the code is now even simpler. New version below (not attaching results in bad > > > >characters due to various mail servers changing line endings). > > > > > > > >OK for commit? > > > > > > I think you could simplify it down to strnlen, memcpy, and memset. > > > > I don't think that's an improvement, at least not in the general case. > > It involves iterating twice over the source string, which for long > > strings could mean blowing the whole cache twice and fetching from > > main memory twice. There's a good reason that string operations are > > usually implemented to perform the copy and length computation > > together in a single pass. > No that is what you think but never tested it. A problem is that in strcpy you spend most of time in call that are at most 256 bytes large. And if you spend 99% of time in some path and you slow down it by 2% to make case in with you spend 1% of time faster it is bad idea even if you could make that 1% for free. Give me name of linux package that regularly does strcpy of 32k+ strings, until then you should not focus on case that does not happen. Anyway we talk about strncpy here, anybody that uses it does not care about performance, its terrible by design, like write 64 byte string then do useless zeroing of remaining 4032 bytes speaks for itself. And as main usage is for fixed size buffers and these buffers are typically less than 32k its even more dubious to optimize for large sizes. > I did a quick experiment with strcpy as it's simpler. Replacing it > with memcpy (d, s, strlen (s) + 1) is 3 times faster even on strings > of 16Mbytes! Perhaps more surprisingly, it has similar performance on > these huge strings as an optimized strcpy. > What architecture? This could also happen because memcpy has special case to handle large strings that speeds this up. Its something that I tried in one-pass strcpy but it harms performance as overhead of checking size is bigger than benefit of larger size. > So the results are pretty clear, if you don't have a super optimized > strcpy, then strlen+memcpy is the best way to do it. > It is not that clear as you spend considerable amount of time on small lenghts, what is important is constant overhead of strcpy startup. However this needs platform specific tricks to decide which alternative is fastest.
> Ondřej Bílka wrote: > On Thu, Sep 11, 2014 at 08:37:17PM +0100, Wilco Dijkstra wrote: > > I did a quick experiment with strcpy as it's simpler. Replacing it > > with memcpy (d, s, strlen (s) + 1) is 3 times faster even on strings > > of 16Mbytes! Perhaps more surprisingly, it has similar performance on > > these huge strings as an optimized strcpy. > > > What architecture? This could also happen because memcpy has special > case to handle large strings that speeds this up. Its something that I > tried in one-pass strcpy but it harms performance as overhead of checking > size is bigger than benefit of larger size. The 3x happens on all 3 ISAs I tried. On ARM the memcpy/strlen variant even beats the optimized strcmp case for most sizes, on x64 it runs at about 80% of the optimized strcpy for sizes above 4KB. > > So the results are pretty clear, if you don't have a super optimized > > strcpy, then strlen+memcpy is the best way to do it. > > > It is not that clear as you spend considerable amount of time on small > lenghts, what is important is constant overhead of strcpy startup. > However this needs platform specific tricks to decide which alternative > is fastest. The overheads are relatively small on modern cores. The memcpy/strlen is always faster than the single loop for lengths larger than 8-16. Wilco
diff --git a/string/strncpy.c b/string/strncpy.c index 0915e03..d5fa5be 100644 --- a/string/strncpy.c +++ b/string/strncpy.c @@ -57,10 +57,10 @@ STRNCPY (char *s1, const char *s2, size_t n) if (--n4 == 0) goto last_chars; } - n = n - (s1 - s) - 1; - if (n == 0) - return s; - goto zero_fill; + s1++; + n = n - (s1 - s); + memset (s1, '\0', n); + return s; } last_chars: @@ -77,11 +77,7 @@ STRNCPY (char *s1, const char *s2, size_t n) } while (c != '\0'); - zero_fill: - do - *++s1 = '\0'; - while (--n > 0); - + memset (s1 + 1, '\0', n); return s; } libc_hidden_builtin_def (strncpy)