[AArch64] Enable _STRING_ARCH_unaligned

Message ID	000101d0db53$e96233c0$bc269b40$@com
State	New
Headers	show Return-Path: <libc-alpha-return-62406-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:subject:date:message-id:mime-version :content-type:content-transfer-encoding; q=dns; s=default; b=rdk /rsMLJU9sRKqXM2Wz0ooE8M+joPiJb7hdW832VecwraynrDwiwpntsj4Mdl9o/Pe BuTL7s2jJnVjpJEmAGJhHkEtbt+Hofxop/iZ4l4YD2MqVkjXFbyDiRRFp7JGJbhQ Kvucg9fnByxUk4ggd1UQYvIiHLVvTKqD4PjfWzac= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: "Wilco Dijkstra" <wdijkstr@arm.com> To: "'GNU C Library'" <libc-alpha@sourceware.org> Subject: [PATCH][AArch64] Enable _STRING_ARCH_unaligned Date: Thu, 20 Aug 2015 15:24:23 +0100 Message-ID: <000101d0db53$e96233c0$bc269b40$@com> MIME-Version: 1.0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable

Wilco Aug. 20, 2015, 2:24 p.m. UTC

Enable _STRING_ARCH_unaligned on AArch64.

2015-08-20  Wilco Dijkstra  <wdijkstr@arm.com>

	* sysdeps/aarch64/bits/string.h: New file.
        (_STRING_ARCH_unaligned): Define.

---
 sysdeps/aarch64/bits/string.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)
 create mode 100644 sysdeps/aarch64/bits/string.h

Andrew Pinski Aug. 20, 2015, 3:43 p.m. UTC | #1

On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> Enable _STRING_ARCH_unaligned on AArch64.
>
> 2015-08-20  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         * sysdeps/aarch64/bits/string.h: New file.
>         (_STRING_ARCH_unaligned): Define.
>
> ---
>  sysdeps/aarch64/bits/string.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>  create mode 100644 sysdeps/aarch64/bits/string.h
>
> diff --git a/sysdeps/aarch64/bits/string.h b/sysdeps/aarch64/bits/string.h
> new file mode 100644
> index 0000000..5221e69
> --- /dev/null
> +++ b/sysdeps/aarch64/bits/string.h
> @@ -0,0 +1,24 @@
> +/* Optimized, inlined string functions.  AArch64 version.
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_H
> +# error "Never use <bits/string.h> directly; include <string.h> instead."
> +#endif
> +
> +/* AArch64 implementations support efficient unaligned access.  */
> +#define _STRING_ARCH_unaligned 1

I don't think this is 100% true.  On ThunderX, an unaligned store or
load takes an extra 8 cycles (a full pipeline flush) as all unaligned
load/stores have to be replayed.
I think we should also benchmark  there to find out if this is a win
because I doubt it is a win but I could be proved wrong.

Thanks,
Andrew Pinski

> --
> 1.9.1
>
>

Andrew Pinski Aug. 20, 2015, 3:52 p.m. UTC | #2

On Thu, Aug 20, 2015 at 11:43 PM, Andrew Pinski <pinskia@gmail.com> wrote:
> On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>> Enable _STRING_ARCH_unaligned on AArch64.
>>
>> 2015-08-20  Wilco Dijkstra  <wdijkstr@arm.com>
>>
>>         * sysdeps/aarch64/bits/string.h: New file.
>>         (_STRING_ARCH_unaligned): Define.
>>
>> ---
>>  sysdeps/aarch64/bits/string.h | 24 ++++++++++++++++++++++++
>>  1 file changed, 24 insertions(+)
>>  create mode 100644 sysdeps/aarch64/bits/string.h
>>
>> diff --git a/sysdeps/aarch64/bits/string.h b/sysdeps/aarch64/bits/string.h
>> new file mode 100644
>> index 0000000..5221e69
>> --- /dev/null
>> +++ b/sysdeps/aarch64/bits/string.h
>> @@ -0,0 +1,24 @@
>> +/* Optimized, inlined string functions.  AArch64 version.
>> +   Copyright (C) 2015 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#ifndef _STRING_H
>> +# error "Never use <bits/string.h> directly; include <string.h> instead."
>> +#endif
>> +
>> +/* AArch64 implementations support efficient unaligned access.  */
>> +#define _STRING_ARCH_unaligned 1
>
> I don't think this is 100% true.  On ThunderX, an unaligned store or
> load takes an extra 8 cycles (a full pipeline flush) as all unaligned
> load/stores have to be replayed.
> I think we should also benchmark  there to find out if this is a win
> because I doubt it is a win but I could be proved wrong.

Are there benchmarks for each of the uses of _STRING_ARCH_unaligned
so I can do the benchmarking on ThunderX?
Also I don't see any benchmark results even for any of the other
AARCH64 processors.

Thanks,
Andrew

>
> Thanks,
> Andrew Pinski
>
>> --
>> 1.9.1
>>
>>

Wilco Aug. 20, 2015, 4:29 p.m. UTC | #3

> Andrew Pinski wrote:
> On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> > +
> > +/* AArch64 implementations support efficient unaligned access.  */
> > +#define _STRING_ARCH_unaligned 1
> 
> I don't think this is 100% true.  On ThunderX, an unaligned store or
> load takes an extra 8 cycles (a full pipeline flush) as all unaligned
> load/stores have to be replayed.
> I think we should also benchmark  there to find out if this is a win
> because I doubt it is a win but I could be proved wrong.

That's bad indeed, but it would still be better than doing everything
one byte at a time. Eg. resolv/arpa/nameser.h does:

#define NS_GET32(l, cp) do { \
        const u_char *t_cp = (const u_char *)(cp); \
        (l) = ((u_int32_t)t_cp[0] << 24) \
            | ((u_int32_t)t_cp[1] << 16) \
            | ((u_int32_t)t_cp[2] << 8) \
            | ((u_int32_t)t_cp[3]) \
            ; \
        (cp) += NS_INT32SZ; \
} while (0)

This becomes an unaligned load plus byteswap with _STRING_ARCH_unaligned 
which should be faster even on ThunderX.

> Are there benchmarks for each of the uses of _STRING_ARCH_unaligned
> so I can do the benchmarking on ThunderX?

I don't believe there are.

> Also I don't see any benchmark results even for any of the other
> AARCH64 processors.

It's obvious it is a big on most of the uses of _STRING_ARCH_unaligned.
Eg. consider the encryption code in crypt/md5.c:

#if !_STRING_ARCH_unaligned
      if (UNALIGNED_P (buffer))
        while (len > 64)
          {
            __md5_process_block (memcpy (ctx->buffer, buffer, 64), 64, ctx);
            buffer = (const char *) buffer + 64;
            len -= 64;
          }
      else
#endif

So basically you end up doing an extra memcpy if unaligned access is not
supported. This means you'll not only do the unaligned loads anyway, but
you'll also do an extra aligned load and store to the buffer.

GLIBC use of _STRING_ARCH_unaligned is quite messy and would benefit from
a major cleanup, however it's quite clear enabling this is a win on overall.

Wilco

Ondřej Bílka Aug. 27, 2015, 6:49 a.m. UTC | #4

On Thu, Aug 20, 2015 at 05:29:18PM +0100, Wilco Dijkstra wrote:
> > Andrew Pinski wrote:
> > On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> > > +
> > > +/* AArch64 implementations support efficient unaligned access.  */
> > > +#define _STRING_ARCH_unaligned 1
> > 
> > I don't think this is 100% true.  On ThunderX, an unaligned store or
> > load takes an extra 8 cycles (a full pipeline flush) as all unaligned
> > load/stores have to be replayed.
> > I think we should also benchmark  there to find out if this is a win
> > because I doubt it is a win but I could be proved wrong.
> 
> That's bad indeed, but it would still be better than doing everything
> one byte at a time. Eg. resolv/arpa/nameser.h does:
> 
There are two things, one is if unaligned loads are possible, second is
if they are fast. If they are not then you will end to emulate them with
aligned loads and shifts to emulate them. And that you still should use
unaligned load in headers of function as latency matters and chain to
create unaligned vector has high latency.


> > Are there benchmarks for each of the uses of _STRING_ARCH_unaligned
> > so I can do the benchmarking on ThunderX?
> 
> I don't believe there are.
>
There is also matter that future optimizations could rely on this so its
better to stay consistent.

 
> > Also I don't see any benchmark results even for any of the other
> > AARCH64 processors.
> 
> It's obvious it is a big on most of the uses of _STRING_ARCH_unaligned.
> Eg. consider the encryption code in crypt/md5.c:
> 
> #if !_STRING_ARCH_unaligned
>       if (UNALIGNED_P (buffer))
>         while (len > 64)
>           {
>             __md5_process_block (memcpy (ctx->buffer, buffer, 64), 64, ctx);
>             buffer = (const char *) buffer + 64;
>             len -= 64;
>           }
>       else
> #endif
> 
> So basically you end up doing an extra memcpy if unaligned access is not
> supported. This means you'll not only do the unaligned loads anyway, but
> you'll also do an extra aligned load and store to the buffer.
> 
> GLIBC use of _STRING_ARCH_unaligned is quite messy and would benefit from
> a major cleanup, however it's quite clear enabling this is a win on overall.
> 
Correct, I se several other optimization opportunities that would
benefit from it.

Marcus Shawcroft Sept. 25, 2015, 2:47 p.m. UTC | #5

On 20 August 2015 at 15:24, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> Enable _STRING_ARCH_unaligned on AArch64.
>
> 2015-08-20  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         * sysdeps/aarch64/bits/string.h: New file.
>         (_STRING_ARCH_unaligned): Define.

OK /Marcus

[AArch64] Enable _STRING_ARCH_unaligned

Commit Message

Comments

Patch