Message ID | 000101d0db53$e96233c0$bc269b40$@com |
---|---|
State | New |
Headers | show |
On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote: > Enable _STRING_ARCH_unaligned on AArch64. > > 2015-08-20 Wilco Dijkstra <wdijkstr@arm.com> > > * sysdeps/aarch64/bits/string.h: New file. > (_STRING_ARCH_unaligned): Define. > > --- > sysdeps/aarch64/bits/string.h | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > create mode 100644 sysdeps/aarch64/bits/string.h > > diff --git a/sysdeps/aarch64/bits/string.h b/sysdeps/aarch64/bits/string.h > new file mode 100644 > index 0000000..5221e69 > --- /dev/null > +++ b/sysdeps/aarch64/bits/string.h > @@ -0,0 +1,24 @@ > +/* Optimized, inlined string functions. AArch64 version. > + Copyright (C) 2015 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +#ifndef _STRING_H > +# error "Never use <bits/string.h> directly; include <string.h> instead." > +#endif > + > +/* AArch64 implementations support efficient unaligned access. */ > +#define _STRING_ARCH_unaligned 1 I don't think this is 100% true. On ThunderX, an unaligned store or load takes an extra 8 cycles (a full pipeline flush) as all unaligned load/stores have to be replayed. I think we should also benchmark there to find out if this is a win because I doubt it is a win but I could be proved wrong. Thanks, Andrew Pinski > -- > 1.9.1 > >
On Thu, Aug 20, 2015 at 11:43 PM, Andrew Pinski <pinskia@gmail.com> wrote: > On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote: >> Enable _STRING_ARCH_unaligned on AArch64. >> >> 2015-08-20 Wilco Dijkstra <wdijkstr@arm.com> >> >> * sysdeps/aarch64/bits/string.h: New file. >> (_STRING_ARCH_unaligned): Define. >> >> --- >> sysdeps/aarch64/bits/string.h | 24 ++++++++++++++++++++++++ >> 1 file changed, 24 insertions(+) >> create mode 100644 sysdeps/aarch64/bits/string.h >> >> diff --git a/sysdeps/aarch64/bits/string.h b/sysdeps/aarch64/bits/string.h >> new file mode 100644 >> index 0000000..5221e69 >> --- /dev/null >> +++ b/sysdeps/aarch64/bits/string.h >> @@ -0,0 +1,24 @@ >> +/* Optimized, inlined string functions. AArch64 version. >> + Copyright (C) 2015 Free Software Foundation, Inc. >> + This file is part of the GNU C Library. >> + >> + The GNU C Library is free software; you can redistribute it and/or >> + modify it under the terms of the GNU Lesser General Public >> + License as published by the Free Software Foundation; either >> + version 2.1 of the License, or (at your option) any later version. >> + >> + The GNU C Library is distributed in the hope that it will be useful, >> + but WITHOUT ANY WARRANTY; without even the implied warranty of >> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + Lesser General Public License for more details. >> + >> + You should have received a copy of the GNU Lesser General Public >> + License along with the GNU C Library; if not, see >> + <http://www.gnu.org/licenses/>. */ >> + >> +#ifndef _STRING_H >> +# error "Never use <bits/string.h> directly; include <string.h> instead." >> +#endif >> + >> +/* AArch64 implementations support efficient unaligned access. */ >> +#define _STRING_ARCH_unaligned 1 > > I don't think this is 100% true. On ThunderX, an unaligned store or > load takes an extra 8 cycles (a full pipeline flush) as all unaligned > load/stores have to be replayed. > I think we should also benchmark there to find out if this is a win > because I doubt it is a win but I could be proved wrong. Are there benchmarks for each of the uses of _STRING_ARCH_unaligned so I can do the benchmarking on ThunderX? Also I don't see any benchmark results even for any of the other AARCH64 processors. Thanks, Andrew > > Thanks, > Andrew Pinski > >> -- >> 1.9.1 >> >>
> Andrew Pinski wrote: > On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote: > > + > > +/* AArch64 implementations support efficient unaligned access. */ > > +#define _STRING_ARCH_unaligned 1 > > I don't think this is 100% true. On ThunderX, an unaligned store or > load takes an extra 8 cycles (a full pipeline flush) as all unaligned > load/stores have to be replayed. > I think we should also benchmark there to find out if this is a win > because I doubt it is a win but I could be proved wrong. That's bad indeed, but it would still be better than doing everything one byte at a time. Eg. resolv/arpa/nameser.h does: #define NS_GET32(l, cp) do { \ const u_char *t_cp = (const u_char *)(cp); \ (l) = ((u_int32_t)t_cp[0] << 24) \ | ((u_int32_t)t_cp[1] << 16) \ | ((u_int32_t)t_cp[2] << 8) \ | ((u_int32_t)t_cp[3]) \ ; \ (cp) += NS_INT32SZ; \ } while (0) This becomes an unaligned load plus byteswap with _STRING_ARCH_unaligned which should be faster even on ThunderX. > Are there benchmarks for each of the uses of _STRING_ARCH_unaligned > so I can do the benchmarking on ThunderX? I don't believe there are. > Also I don't see any benchmark results even for any of the other > AARCH64 processors. It's obvious it is a big on most of the uses of _STRING_ARCH_unaligned. Eg. consider the encryption code in crypt/md5.c: #if !_STRING_ARCH_unaligned if (UNALIGNED_P (buffer)) while (len > 64) { __md5_process_block (memcpy (ctx->buffer, buffer, 64), 64, ctx); buffer = (const char *) buffer + 64; len -= 64; } else #endif So basically you end up doing an extra memcpy if unaligned access is not supported. This means you'll not only do the unaligned loads anyway, but you'll also do an extra aligned load and store to the buffer. GLIBC use of _STRING_ARCH_unaligned is quite messy and would benefit from a major cleanup, however it's quite clear enabling this is a win on overall. Wilco
On Thu, Aug 20, 2015 at 05:29:18PM +0100, Wilco Dijkstra wrote: > > Andrew Pinski wrote: > > On Thu, Aug 20, 2015 at 10:24 PM, Wilco Dijkstra <wdijkstr@arm.com> wrote: > > > + > > > +/* AArch64 implementations support efficient unaligned access. */ > > > +#define _STRING_ARCH_unaligned 1 > > > > I don't think this is 100% true. On ThunderX, an unaligned store or > > load takes an extra 8 cycles (a full pipeline flush) as all unaligned > > load/stores have to be replayed. > > I think we should also benchmark there to find out if this is a win > > because I doubt it is a win but I could be proved wrong. > > That's bad indeed, but it would still be better than doing everything > one byte at a time. Eg. resolv/arpa/nameser.h does: > There are two things, one is if unaligned loads are possible, second is if they are fast. If they are not then you will end to emulate them with aligned loads and shifts to emulate them. And that you still should use unaligned load in headers of function as latency matters and chain to create unaligned vector has high latency. > > Are there benchmarks for each of the uses of _STRING_ARCH_unaligned > > so I can do the benchmarking on ThunderX? > > I don't believe there are. > There is also matter that future optimizations could rely on this so its better to stay consistent. > > Also I don't see any benchmark results even for any of the other > > AARCH64 processors. > > It's obvious it is a big on most of the uses of _STRING_ARCH_unaligned. > Eg. consider the encryption code in crypt/md5.c: > > #if !_STRING_ARCH_unaligned > if (UNALIGNED_P (buffer)) > while (len > 64) > { > __md5_process_block (memcpy (ctx->buffer, buffer, 64), 64, ctx); > buffer = (const char *) buffer + 64; > len -= 64; > } > else > #endif > > So basically you end up doing an extra memcpy if unaligned access is not > supported. This means you'll not only do the unaligned loads anyway, but > you'll also do an extra aligned load and store to the buffer. > > GLIBC use of _STRING_ARCH_unaligned is quite messy and would benefit from > a major cleanup, however it's quite clear enabling this is a win on overall. > Correct, I se several other optimization opportunities that would benefit from it.
On 20 August 2015 at 15:24, Wilco Dijkstra <wdijkstr@arm.com> wrote: > Enable _STRING_ARCH_unaligned on AArch64. > > 2015-08-20 Wilco Dijkstra <wdijkstr@arm.com> > > * sysdeps/aarch64/bits/string.h: New file. > (_STRING_ARCH_unaligned): Define. OK /Marcus
diff --git a/sysdeps/aarch64/bits/string.h b/sysdeps/aarch64/bits/string.h new file mode 100644 index 0000000..5221e69 --- /dev/null +++ b/sysdeps/aarch64/bits/string.h @@ -0,0 +1,24 @@ +/* Optimized, inlined string functions. AArch64 version. + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#ifndef _STRING_H +# error "Never use <bits/string.h> directly; include <string.h> instead." +#endif + +/* AArch64 implementations support efficient unaligned access. */ +#define _STRING_ARCH_unaligned 1