Message ID | Zv2M3rKroQ68Xoou@gate |
---|---|
State | New |
Headers | show |
Series | libstdc++: Unroll loop in load_bytes function | expand |
On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > Instead of looping over every byte of the tail, unroll loop manually > using switch statement, then compilers (at least GCC and Clang) will > generate a jump table [1], which is faster on a microbenchmark [2]. > > [1]: https://godbolt.org/z/aE8Mq3j5G > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > libstdc++-v3/ChangeLog: > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > loop using switch statement. > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > --- > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > 1 file changed, 23 insertions(+), 4 deletions(-) > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > index 3665375096a..294a7323dd0 100644 > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > @@ -50,10 +50,29 @@ namespace > load_bytes(const char* p, int n) > { > std::size_t result = 0; > - --n; > - do > - result = (result << 8) + static_cast<unsigned char>(p[n]); > - while (--n >= 0); Don't we still need to loop, for the case where n >= 8? Otherwise we only hash the first 8 bytes. > + switch(n & 7) > + { > + case 7: > + result |= std::size_t(p[6]) << 48; > + [[gnu::fallthrough]]; > + case 6: > + result |= std::size_t(p[5]) << 40; > + [[gnu::fallthrough]]; > + case 5: > + result |= std::size_t(p[4]) << 32; > + [[gnu::fallthrough]]; > + case 4: > + result |= std::size_t(p[3]) << 24; > + [[gnu::fallthrough]]; > + case 3: > + result |= std::size_t(p[2]) << 16; > + [[gnu::fallthrough]]; > + case 2: > + result |= std::size_t(p[1]) << 8; > + [[gnu::fallthrough]]; > + case 1: > + result |= std::size_t(p[0]); > + }; > return result; > } > > -- > 2.43.5 >
On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > Instead of looping over every byte of the tail, unroll loop manually > > using switch statement, then compilers (at least GCC and Clang) will > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > libstdc++-v3/ChangeLog: > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > loop using switch statement. > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > --- > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > index 3665375096a..294a7323dd0 100644 > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > @@ -50,10 +50,29 @@ namespace > > load_bytes(const char* p, int n) > > { > > std::size_t result = 0; > > - --n; > > - do > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > - while (--n >= 0); > > Don't we still need to loop, for the case where n >= 8? Otherwise we > only hash the first 8 bytes. Ah, but it's only ever called with load_bytes(end, len & 0x7) > > > + switch(n & 7) > > + { > > + case 7: > > + result |= std::size_t(p[6]) << 48; > > + [[gnu::fallthrough]]; > > + case 6: > > + result |= std::size_t(p[5]) << 40; > > + [[gnu::fallthrough]]; > > + case 5: > > + result |= std::size_t(p[4]) << 32; > > + [[gnu::fallthrough]]; > > + case 4: > > + result |= std::size_t(p[3]) << 24; > > + [[gnu::fallthrough]]; > > + case 3: > > + result |= std::size_t(p[2]) << 16; > > + [[gnu::fallthrough]]; > > + case 2: > > + result |= std::size_t(p[1]) << 8; > > + [[gnu::fallthrough]]; > > + case 1: > > + result |= std::size_t(p[0]); > > + }; > > return result; > > } > > > > -- > > 2.43.5 > >
On Wed, 2 Oct 2024 at 19:25, Jonathan Wakely <jwakely@redhat.com> wrote: > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > using switch statement, then compilers (at least GCC and Clang) will > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > libstdc++-v3/ChangeLog: > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > loop using switch statement. > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > --- > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > index 3665375096a..294a7323dd0 100644 > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > @@ -50,10 +50,29 @@ namespace > > > load_bytes(const char* p, int n) > > > { > > > std::size_t result = 0; > > > - --n; > > > - do > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > - while (--n >= 0); > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > only hash the first 8 bytes. > > Ah, but it's only ever called with load_bytes(end, len & 0x7) It seems to be slower for short strings, but a win overall: https://quick-bench.com/q/xhh5m1akZzwUAXRiYJ17z9FASc8 This measures different lengths, and tries to ensure that the string contents aren't treated as constant. > > > > > > > + switch(n & 7) > > > + { > > > + case 7: > > > + result |= std::size_t(p[6]) << 48; > > > + [[gnu::fallthrough]]; > > > + case 6: > > > + result |= std::size_t(p[5]) << 40; > > > + [[gnu::fallthrough]]; > > > + case 5: > > > + result |= std::size_t(p[4]) << 32; > > > + [[gnu::fallthrough]]; > > > + case 4: > > > + result |= std::size_t(p[3]) << 24; > > > + [[gnu::fallthrough]]; > > > + case 3: > > > + result |= std::size_t(p[2]) << 16; > > > + [[gnu::fallthrough]]; > > > + case 2: > > > + result |= std::size_t(p[1]) << 8; > > > + [[gnu::fallthrough]]; > > > + case 1: > > > + result |= std::size_t(p[0]); > > > + }; > > > return result; > > > } > > > > > > -- > > > 2.43.5 > > >
On Wed, Oct 02, 2024 at 08:15:38PM +0100, Jonathan Wakely wrote: > On Wed, 2 Oct 2024 at 19:25, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > > using switch statement, then compilers (at least GCC and Clang) will > > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > > loop using switch statement. > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > > --- > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > index 3665375096a..294a7323dd0 100644 > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > @@ -50,10 +50,29 @@ namespace > > > > load_bytes(const char* p, int n) > > > > { > > > > std::size_t result = 0; > > > > - --n; > > > > - do > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > - while (--n >= 0); > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > only hash the first 8 bytes. > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > It seems to be slower for short strings, but a win overall: > https://quick-bench.com/q/xhh5m1akZzwUAXRiYJ17z9FASc8 > This measures different lengths, and tries to ensure that the string > contents aren't treated as constant. Nice find, thanks for spotting it. In retrospect, I think case n = 7 is best for unrolled switch version and worst for loop version, because it maximize overhead from loop control flow instructions. In contrast, case n = 1 is completely opposite in that sense, we have minimal overhead from loop control flow. If explanation above is correct, then not sure we can do something better for small cases. Seems like this lunch is not completely free. > > > > > > > > > > > > + switch(n & 7) > > > > + { > > > > + case 7: > > > > + result |= std::size_t(p[6]) << 48; > > > > + [[gnu::fallthrough]]; > > > > + case 6: > > > > + result |= std::size_t(p[5]) << 40; > > > > + [[gnu::fallthrough]]; > > > > + case 5: > > > > + result |= std::size_t(p[4]) << 32; > > > > + [[gnu::fallthrough]]; > > > > + case 4: > > > > + result |= std::size_t(p[3]) << 24; > > > > + [[gnu::fallthrough]]; > > > > + case 3: > > > > + result |= std::size_t(p[2]) << 16; > > > > + [[gnu::fallthrough]]; > > > > + case 2: > > > > + result |= std::size_t(p[1]) << 8; > > > > + [[gnu::fallthrough]]; > > > > + case 1: > > > > + result |= std::size_t(p[0]); > > > > + }; > > > > return result; > > > > } > > > > > > > > -- > > > > 2.43.5 > > > > > Forgot to CC mailing lists. Sorry about that.
On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > using switch statement, then compilers (at least GCC and Clang) will > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > libstdc++-v3/ChangeLog: > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > loop using switch statement. > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > --- > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > index 3665375096a..294a7323dd0 100644 > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > @@ -50,10 +50,29 @@ namespace > > > load_bytes(const char* p, int n) > > > { > > > std::size_t result = 0; > > > - --n; > > > - do > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > - while (--n >= 0); > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > only hash the first 8 bytes. > > Ah, but it's only ever called with load_bytes(end, len & 0x7) The compiler should do such transforms - you probably want to tell it that n < 8 though, it likely doesn't (always) know. > > > > > > > + switch(n & 7) > > > + { > > > + case 7: > > > + result |= std::size_t(p[6]) << 48; > > > + [[gnu::fallthrough]]; > > > + case 6: > > > + result |= std::size_t(p[5]) << 40; > > > + [[gnu::fallthrough]]; > > > + case 5: > > > + result |= std::size_t(p[4]) << 32; > > > + [[gnu::fallthrough]]; > > > + case 4: > > > + result |= std::size_t(p[3]) << 24; > > > + [[gnu::fallthrough]]; > > > + case 3: > > > + result |= std::size_t(p[2]) << 16; > > > + [[gnu::fallthrough]]; > > > + case 2: > > > + result |= std::size_t(p[1]) << 8; > > > + [[gnu::fallthrough]]; > > > + case 1: > > > + result |= std::size_t(p[0]); > > > + }; > > > return result; > > > } > > > > > > -- > > > 2.43.5 > > > >
On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote: > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > > using switch statement, then compilers (at least GCC and Clang) will > > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > > loop using switch statement. > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > > --- > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > index 3665375096a..294a7323dd0 100644 > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > @@ -50,10 +50,29 @@ namespace > > > > load_bytes(const char* p, int n) > > > > { > > > > std::size_t result = 0; > > > > - --n; > > > > - do > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > - while (--n >= 0); > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > only hash the first 8 bytes. > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > The compiler should do such transforms - you probably want to tell > it that n < 8 though, it likely doesn't (always) know. e.g. like this? if ((n & 7) != n) __builtin_unreachable(); For the microbenchmark that seems to make things consistently worse: https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw
On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote: > > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote: > > > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > > > using switch statement, then compilers (at least GCC and Clang) will > > > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > > > loop using switch statement. > > > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > > > --- > > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > index 3665375096a..294a7323dd0 100644 > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > @@ -50,10 +50,29 @@ namespace > > > > > load_bytes(const char* p, int n) > > > > > { > > > > > std::size_t result = 0; > > > > > - --n; > > > > > - do > > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > > - while (--n >= 0); > > > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > > only hash the first 8 bytes. > > > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > > > The compiler should do such transforms - you probably want to tell > > it that n < 8 though, it likely doesn't (always) know. > > e.g. like this? > > if ((n & 7) != n) > __builtin_unreachable(); > > For the microbenchmark that seems to make things consistently worse: > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <= n is always true too.
On Fri, Oct 4, 2024 at 11:20 AM Jonathan Wakely <jwakely@redhat.com> wrote: > > On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote: > > > > > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > > > > using switch statement, then compilers (at least GCC and Clang) will > > > > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > > > > loop using switch statement. > > > > > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > > > > --- > > > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > index 3665375096a..294a7323dd0 100644 > > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > @@ -50,10 +50,29 @@ namespace > > > > > > load_bytes(const char* p, int n) > > > > > > { > > > > > > std::size_t result = 0; > > > > > > - --n; > > > > > > - do > > > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > > > - while (--n >= 0); > > > > > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > > > only hash the first 8 bytes. > > > > > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > > > > > The compiler should do such transforms - you probably want to tell > > > it that n < 8 though, it likely doesn't (always) know. > > > > e.g. like this? > > > > if ((n & 7) != n) > > __builtin_unreachable(); > > > > For the microbenchmark that seems to make things consistently worse: > > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw > > Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <= > n is always true too. Yes, expose the tightest range to the compiler. Of course this might trigger opportunistic optimiztation like complete peeling. Richard.
On Fri, Oct 04, 2024 at 10:20:27AM +0100, Jonathan Wakely wrote: > On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote: > > > > > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > > > > using switch statement, then compilers (at least GCC and Clang) will > > > > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > > > > loop using switch statement. > > > > > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > > > > --- > > > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > index 3665375096a..294a7323dd0 100644 > > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > @@ -50,10 +50,29 @@ namespace > > > > > > load_bytes(const char* p, int n) > > > > > > { > > > > > > std::size_t result = 0; > > > > > > - --n; > > > > > > - do > > > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > > > - while (--n >= 0); > > > > > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > > > only hash the first 8 bytes. > > > > > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > > > > > The compiler should do such transforms - you probably want to tell > > > it that n < 8 though, it likely doesn't (always) know. > > > > e.g. like this? > > > > if ((n & 7) != n) > > __builtin_unreachable(); > > > > For the microbenchmark that seems to make things consistently worse: > > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw > > Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <= > n is always true too. > GCC still wasn't able to unroll the loop, even with a __builtin_unreachable, but benchmark link you mentioned above uses -O2 optimization level (not sure if it was intentional). If we'll use -O3 [1], then GCC was able to unroll the loop for load_bytes_loop_assume version, but at the same time I am not sure all loop control instructions were elided, I still can see them on Godbolt version of generated code [2]. Benchmark charts partially confirm that, because performance of load_bytes_loop and load_bytes_loop_assume are now quite close (same actually, except case n = 1). I guess it would make sense, as we execute same amount of instructions. In addition, chart for load_bytes_switch look quite jumpy for [1] and became better for cases n = 1 and n = 2. At this point I am not sure it is not a code alignment issue and we are not measuring noise. [1]: https://quick-bench.com/q/LlcgMVhL61CasZVjCWbHd3uid8w [2]: https://godbolt.org/z/qPf1n7xWs
On Fri, 4 Oct 2024 at 13:53, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > On Fri, Oct 04, 2024 at 10:20:27AM +0100, Jonathan Wakely wrote: > > On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote: > > > > > > > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > > > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > > > > > > > > > > > Instead of looping over every byte of the tail, unroll loop manually > > > > > > > using switch statement, then compilers (at least GCC and Clang) will > > > > > > > generate a jump table [1], which is faster on a microbenchmark [2]. > > > > > > > > > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G > > > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 > > > > > > > > > > > > > > libstdc++-v3/ChangeLog: > > > > > > > > > > > > > > * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll > > > > > > > loop using switch statement. > > > > > > > > > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> > > > > > > > --- > > > > > > > libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- > > > > > > > 1 file changed, 23 insertions(+), 4 deletions(-) > > > > > > > > > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > > index 3665375096a..294a7323dd0 100644 > > > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc > > > > > > > @@ -50,10 +50,29 @@ namespace > > > > > > > load_bytes(const char* p, int n) > > > > > > > { > > > > > > > std::size_t result = 0; > > > > > > > - --n; > > > > > > > - do > > > > > > > - result = (result << 8) + static_cast<unsigned char>(p[n]); > > > > > > > - while (--n >= 0); > > > > > > > > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we > > > > > > only hash the first 8 bytes. > > > > > > > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7) > > > > > > > > The compiler should do such transforms - you probably want to tell > > > > it that n < 8 though, it likely doesn't (always) know. > > > > > > e.g. like this? > > > > > > if ((n & 7) != n) > > > __builtin_unreachable(); > > > > > > For the microbenchmark that seems to make things consistently worse: > > > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw > > > > Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <= > > n is always true too. > > > > GCC still wasn't able to unroll the loop, even with a > __builtin_unreachable, but benchmark link you mentioned above uses -O2 > optimization level (not sure if it was intentional). That was intentional, because that's how libsupc++/hash_bytes.cc gets compiled. > > If we'll use -O3 [1], then GCC was able to unroll the loop for > load_bytes_loop_assume version, but at the same time I am not sure all > loop control instructions were elided, I still can see them on Godbolt > version of generated code [2]. Benchmark charts partially confirm that, > because performance of load_bytes_loop and load_bytes_loop_assume are > now quite close (same actually, except case n = 1). I guess it would > make sense, as we execute same amount of instructions. > > In addition, chart for load_bytes_switch look quite jumpy for [1] and > became better for cases n = 1 and n = 2. At this point I am not sure it > is not a code alignment issue and we are not measuring noise. > > [1]: https://quick-bench.com/q/LlcgMVhL61CasZVjCWbHd3uid8w > [2]: https://godbolt.org/z/qPf1n7xWs >
diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc index 3665375096a..294a7323dd0 100644 --- a/libstdc++-v3/libsupc++/hash_bytes.cc +++ b/libstdc++-v3/libsupc++/hash_bytes.cc @@ -50,10 +50,29 @@ namespace load_bytes(const char* p, int n) { std::size_t result = 0; - --n; - do - result = (result << 8) + static_cast<unsigned char>(p[n]); - while (--n >= 0); + switch(n & 7) + { + case 7: + result |= std::size_t(p[6]) << 48; + [[gnu::fallthrough]]; + case 6: + result |= std::size_t(p[5]) << 40; + [[gnu::fallthrough]]; + case 5: + result |= std::size_t(p[4]) << 32; + [[gnu::fallthrough]]; + case 4: + result |= std::size_t(p[3]) << 24; + [[gnu::fallthrough]]; + case 3: + result |= std::size_t(p[2]) << 16; + [[gnu::fallthrough]]; + case 2: + result |= std::size_t(p[1]) << 8; + [[gnu::fallthrough]]; + case 1: + result |= std::size_t(p[0]); + }; return result; }
Instead of looping over every byte of the tail, unroll loop manually using switch statement, then compilers (at least GCC and Clang) will generate a jump table [1], which is faster on a microbenchmark [2]. [1]: https://godbolt.org/z/aE8Mq3j5G [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24 libstdc++-v3/ChangeLog: * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll loop using switch statement. Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> --- libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-)