Improve performance of strncpy

Message ID	001301cfcd0a$f0b62670$d2227350$@com
State	New
Headers	show Return-Path: <libc-alpha-return-52681-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:references:in-reply-to:subject:date :message-id:mime-version:content-type:content-transfer-encoding; q=dns; s=default; b=b3BxHdlTLwMguOAan6RHlVM8OmOH51v78Olu4JrJpSm k5ljZVBptanfFre88PcG+AQD8kq2BuFrwX+WaAhMw8aOGPQb1oJKRLkP723JDelE TZ9f0fxZDkp8UZmykZvyDJkCl1A4CNOdh8ngJairaFOJ6JrcreBFQf/W46/fSx80 = Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: "Wilco Dijkstra" <wdijkstr@arm.com> To: <azanella@linux.vnet.ibm.com> Cc: <libc-alpha@sourceware.org> References: In-Reply-To: Subject: RE: [PATCH] Improve performance of strncpy Date: Wed, 10 Sep 2014 16:21:48 +0100 Message-ID: <001301cfcd0a$f0b62670$d2227350$@com> MIME-Version: 1.0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable

Wilco Sept. 10, 2014, 3:21 p.m. UTC

Adhemerval Zanella wrote:
> Hi, the patch looks ok. I also pushed a similar modification for powerpc based on same idea. 

>   zero_fill:
> -  do
> -    *++s1 = '\0';
> -  while (--n > 0);
> +  if (n >= 8)
> +    memset (s1 + 1, '\0', n);
> +  else
> +    do
> +      *++s1 = '\0';
> +    while (--n > 0);

> I wonder if this test is really worth, my opinion is just to keep it simple
> and just call memset on both 'goto' in loop and after 'last_chars'.

Yes, you're right, I timed it and there is actually little difference, while
the code is now even simpler. New version below (not attaching results in bad
characters due to various mail servers changing line endings).

OK for commit?

---
 string/strncpy.c |   14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

Florian Weimer Sept. 10, 2014, 5:34 p.m. UTC | #1

On 09/10/2014 05:21 PM, Wilco Dijkstra wrote:
> Yes, you're right, I timed it and there is actually little difference, while
> the code is now even simpler. New version below (not attaching results in bad
> characters due to various mail servers changing line endings).
>
> OK for commit?

I think you could simplify it down to strnlen, memcpy, and memset.

Rich Felker Sept. 10, 2014, 6:01 p.m. UTC | #2

On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote:
> On 09/10/2014 05:21 PM, Wilco Dijkstra wrote:
> >Yes, you're right, I timed it and there is actually little difference, while
> >the code is now even simpler. New version below (not attaching results in bad
> >characters due to various mail servers changing line endings).
> >
> >OK for commit?
> 
> I think you could simplify it down to strnlen, memcpy, and memset.

I don't think that's an improvement, at least not in the general case.
It involves iterating twice over the source string, which for long
strings could mean blowing the whole cache twice and fetching from
main memory twice. There's a good reason that string operations are
usually implemented to perform the copy and length computation
together in a single pass.

Rich

Wilco Sept. 10, 2014, 6:09 p.m. UTC | #3

> Florian Weimer wrote:
> On 09/10/2014 05:21 PM, Wilco Dijkstra wrote:
> > Yes, you're right, I timed it and there is actually little difference, while
> > the code is now even simpler. New version below (not attaching results in bad
> > characters due to various mail servers changing line endings).
> >
> > OK for commit?
> 
> I think you could simplify it down to strnlen, memcpy, and memset.

I'll have a go at that in a follow-on patch. Something similar is possible with
strcpy and strncat which currently do byte-by-byte copies as well. Simplifying 
the code is certainly a nice benefit besides the performance gain on longer
strings.

Wilco

Wilco Sept. 10, 2014, 6:25 p.m. UTC | #4

> Rich Felker wrote:
> On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote:
> > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote:
> > >Yes, you're right, I timed it and there is actually little difference, while
> > >the code is now even simpler. New version below (not attaching results in bad
> > >characters due to various mail servers changing line endings).
> > >
> > >OK for commit?
> >
> > I think you could simplify it down to strnlen, memcpy, and memset.
> 
> I don't think that's an improvement, at least not in the general case.
> It involves iterating twice over the source string, which for long
> strings could mean blowing the whole cache twice and fetching from
> main memory twice. There's a good reason that string operations are
> usually implemented to perform the copy and length computation
> together in a single pass.
> 
> Rich

Few strings will be larger than the typical L1 size of 32KB. You're right
that it is best to do a single pass in a highly optimized implementation.
However the issue is that the C versions are so slow that even doing 2
passes will be significantly faster due to processing 8 bytes at a time - 
likely even if much larger than L1 (I'll check that).

The goal of these patches is to ensure the C string routines are quite
competitive out of the box, and benefit further when you add a few highly
optimized routines (eg. strlen/strcpy). That means new targets are not
forced to add optimized versions of all of the string routines in order to
get decent performance (as unfortunately is the case today).

Wilco

Wilco Sept. 11, 2014, 7:37 p.m. UTC | #5

> Rich Felker wrote:
> On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote:
> > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote:
> > >Yes, you're right, I timed it and there is actually little difference, while
> > >the code is now even simpler. New version below (not attaching results in bad
> > >characters due to various mail servers changing line endings).
> > >
> > >OK for commit?
> >
> > I think you could simplify it down to strnlen, memcpy, and memset.
> 
> I don't think that's an improvement, at least not in the general case.
> It involves iterating twice over the source string, which for long
> strings could mean blowing the whole cache twice and fetching from
> main memory twice. There's a good reason that string operations are
> usually implemented to perform the copy and length computation
> together in a single pass.

I did a quick experiment with strcpy as it's simpler. Replacing it 
with memcpy (d, s, strlen (s) + 1) is 3 times faster even on strings 
of 16Mbytes! Perhaps more surprisingly, it has similar performance on 
these huge strings as an optimized strcpy.

So the results are pretty clear, if you don't have a super optimized
strcpy, then strlen+memcpy is the best way to do it.

Wilco

Ondřej Bílka Sept. 12, 2014, 6:22 a.m. UTC | #6

On Thu, Sep 11, 2014 at 08:37:17PM +0100, Wilco Dijkstra wrote:
> > Rich Felker wrote:
> > On Wed, Sep 10, 2014 at 07:34:40PM +0200, Florian Weimer wrote:
> > > On 09/10/2014 05:21 PM, Wilco Dijkstra wrote:
> > > >Yes, you're right, I timed it and there is actually little difference, while
> > > >the code is now even simpler. New version below (not attaching results in bad
> > > >characters due to various mail servers changing line endings).
> > > >
> > > >OK for commit?
> > >
> > > I think you could simplify it down to strnlen, memcpy, and memset.
> > 
> > I don't think that's an improvement, at least not in the general case.
> > It involves iterating twice over the source string, which for long
> > strings could mean blowing the whole cache twice and fetching from
> > main memory twice. There's a good reason that string operations are
> > usually implemented to perform the copy and length computation
> > together in a single pass.
>
No that is what you think but never tested it. A problem is that in
strcpy you spend most of time in call that are at most 256 bytes large. 
And if you spend 99% of time in some path and you slow down it by 2% to
make case in with you spend 1% of time faster it is bad idea even if you
could make that 1% for free.

Give me name of linux package that regularly does strcpy of 32k+ strings, 
until then you should not focus on case that does not happen.

Anyway we talk about strncpy here, anybody that uses it does not care
about performance, its terrible by design, like write 64 byte string then do
useless zeroing of remaining 4032 bytes speaks for itself.

And as main usage is for fixed size buffers and these buffers are
typically less than 32k its even more dubious to optimize for large
sizes.

> I did a quick experiment with strcpy as it's simpler. Replacing it 
> with memcpy (d, s, strlen (s) + 1) is 3 times faster even on strings 
> of 16Mbytes! Perhaps more surprisingly, it has similar performance on 
> these huge strings as an optimized strcpy.
>
What architecture? This could also happen because memcpy has special 
case to handle large strings that speeds this up. Its something that I
tried in one-pass strcpy but it harms performance as overhead of checking
size is bigger than benefit of larger size.
 
> So the results are pretty clear, if you don't have a super optimized
> strcpy, then strlen+memcpy is the best way to do it.
> 
It is not that clear as you spend considerable amount of time on small
lenghts, what is important is constant overhead of strcpy startup.
However this needs platform specific tricks to decide which alternative
is fastest.

Wilco Sept. 12, 2014, 11:04 a.m. UTC | #7

> Ondřej Bílka wrote:
> On Thu, Sep 11, 2014 at 08:37:17PM +0100, Wilco Dijkstra wrote:
> > I did a quick experiment with strcpy as it's simpler. Replacing it
> > with memcpy (d, s, strlen (s) + 1) is 3 times faster even on strings
> > of 16Mbytes! Perhaps more surprisingly, it has similar performance on
> > these huge strings as an optimized strcpy.
> >
> What architecture? This could also happen because memcpy has special
> case to handle large strings that speeds this up. Its something that I
> tried in one-pass strcpy but it harms performance as overhead of checking
> size is bigger than benefit of larger size.

The 3x happens on all 3 ISAs I tried. On ARM the memcpy/strlen variant
even beats the optimized strcmp case for most sizes, on x64 it runs at
about 80% of the optimized strcpy for sizes above 4KB.

> > So the results are pretty clear, if you don't have a super optimized
> > strcpy, then strlen+memcpy is the best way to do it.
> >
> It is not that clear as you spend considerable amount of time on small
> lenghts, what is important is constant overhead of strcpy startup.
> However this needs platform specific tricks to decide which alternative
> is fastest.

The overheads are relatively small on modern cores. The memcpy/strlen
is always faster than the single loop for lengths larger than 8-16.

Wilco

Improve performance of strncpy

Commit Message

Comments

Patch