libstdc++: Unroll loop in load_bytes function

Message ID	Zv2M3rKroQ68Xoou@gate
State	New
Headers	show Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4C96F3858D29 sender: d@ilvokhin.com) by mail.ilvokhin.com (Postfix) with ESMTPSA id 20CC43D68F; Wed, 02 Oct 2024 18:11:48 +0000 (UTC) Date: Wed, 2 Oct 2024 18:11:42 +0000 From: Dmitry Ilvokhin <d@ilvokhin.com> To: libstdc++@gcc.gnu.org Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] libstdc++: Unroll loop in load_bytes function Message-ID: <Zv2M3rKroQ68Xoou@gate> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: list Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	libstdc++: Unroll loop in load_bytes function \| expand libstdc++: Unroll loop in load_bytes function

Dmitry Ilvokhin Oct. 2, 2024, 6:11 p.m. UTC

Instead of looping over every byte of the tail, unroll loop manually
using switch statement, then compilers (at least GCC and Clang) will
generate a jump table [1], which is faster on a microbenchmark [2].

[1]: https://godbolt.org/z/aE8Mq3j5G
[2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24

libstdc++-v3/ChangeLog:

        * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
          loop using switch statement.

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
 libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

Jonathan Wakely Oct. 2, 2024, 6:16 p.m. UTC | #1

On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> Instead of looping over every byte of the tail, unroll loop manually
> using switch statement, then compilers (at least GCC and Clang) will
> generate a jump table [1], which is faster on a microbenchmark [2].
>
> [1]: https://godbolt.org/z/aE8Mq3j5G
> [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
>
> libstdc++-v3/ChangeLog:
>
>         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
>           loop using switch statement.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> ---
>  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> index 3665375096a..294a7323dd0 100644
> --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> @@ -50,10 +50,29 @@ namespace
>    load_bytes(const char* p, int n)
>    {
>      std::size_t result = 0;
> -    --n;
> -    do
> -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> -    while (--n >= 0);

Don't we still need to loop, for the case where n >= 8? Otherwise we
only hash the first 8 bytes.

> +    switch(n & 7)
> +      {
> +      case 7:
> +       result |= std::size_t(p[6]) << 48;
> +       [[gnu::fallthrough]];
> +      case 6:
> +       result |= std::size_t(p[5]) << 40;
> +       [[gnu::fallthrough]];
> +      case 5:
> +       result |= std::size_t(p[4]) << 32;
> +       [[gnu::fallthrough]];
> +      case 4:
> +       result |= std::size_t(p[3]) << 24;
> +       [[gnu::fallthrough]];
> +      case 3:
> +       result |= std::size_t(p[2]) << 16;
> +       [[gnu::fallthrough]];
> +      case 2:
> +       result |= std::size_t(p[1]) << 8;
> +       [[gnu::fallthrough]];
> +      case 1:
> +       result |= std::size_t(p[0]);
> +      };
>      return result;
>    }
>
> --
> 2.43.5
>

Jonathan Wakely Oct. 2, 2024, 6:25 p.m. UTC | #2

On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> >
> > Instead of looping over every byte of the tail, unroll loop manually
> > using switch statement, then compilers (at least GCC and Clang) will
> > generate a jump table [1], which is faster on a microbenchmark [2].
> >
> > [1]: https://godbolt.org/z/aE8Mq3j5G
> > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> >
> > libstdc++-v3/ChangeLog:
> >
> >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> >           loop using switch statement.
> >
> > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > ---
> >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> >  1 file changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > index 3665375096a..294a7323dd0 100644
> > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > @@ -50,10 +50,29 @@ namespace
> >    load_bytes(const char* p, int n)
> >    {
> >      std::size_t result = 0;
> > -    --n;
> > -    do
> > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > -    while (--n >= 0);
>
> Don't we still need to loop, for the case where n >= 8? Otherwise we
> only hash the first 8 bytes.

Ah, but it's only ever called with load_bytes(end, len & 0x7)



>
> > +    switch(n & 7)
> > +      {
> > +      case 7:
> > +       result |= std::size_t(p[6]) << 48;
> > +       [[gnu::fallthrough]];
> > +      case 6:
> > +       result |= std::size_t(p[5]) << 40;
> > +       [[gnu::fallthrough]];
> > +      case 5:
> > +       result |= std::size_t(p[4]) << 32;
> > +       [[gnu::fallthrough]];
> > +      case 4:
> > +       result |= std::size_t(p[3]) << 24;
> > +       [[gnu::fallthrough]];
> > +      case 3:
> > +       result |= std::size_t(p[2]) << 16;
> > +       [[gnu::fallthrough]];
> > +      case 2:
> > +       result |= std::size_t(p[1]) << 8;
> > +       [[gnu::fallthrough]];
> > +      case 1:
> > +       result |= std::size_t(p[0]);
> > +      };
> >      return result;
> >    }
> >
> > --
> > 2.43.5
> >

Jonathan Wakely Oct. 2, 2024, 7:15 p.m. UTC | #3

On Wed, 2 Oct 2024 at 19:25, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> >
> > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > >
> > > Instead of looping over every byte of the tail, unroll loop manually
> > > using switch statement, then compilers (at least GCC and Clang) will
> > > generate a jump table [1], which is faster on a microbenchmark [2].
> > >
> > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > >
> > > libstdc++-v3/ChangeLog:
> > >
> > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > >           loop using switch statement.
> > >
> > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > ---
> > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > index 3665375096a..294a7323dd0 100644
> > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > @@ -50,10 +50,29 @@ namespace
> > >    load_bytes(const char* p, int n)
> > >    {
> > >      std::size_t result = 0;
> > > -    --n;
> > > -    do
> > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > -    while (--n >= 0);
> >
> > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > only hash the first 8 bytes.
>
> Ah, but it's only ever called with load_bytes(end, len & 0x7)

It seems to be slower for short strings, but a win overall:
https://quick-bench.com/q/xhh5m1akZzwUAXRiYJ17z9FASc8
This measures different lengths, and tries to ensure that the string
contents aren't treated as constant.

>
>
> >
> > > +    switch(n & 7)
> > > +      {
> > > +      case 7:
> > > +       result |= std::size_t(p[6]) << 48;
> > > +       [[gnu::fallthrough]];
> > > +      case 6:
> > > +       result |= std::size_t(p[5]) << 40;
> > > +       [[gnu::fallthrough]];
> > > +      case 5:
> > > +       result |= std::size_t(p[4]) << 32;
> > > +       [[gnu::fallthrough]];
> > > +      case 4:
> > > +       result |= std::size_t(p[3]) << 24;
> > > +       [[gnu::fallthrough]];
> > > +      case 3:
> > > +       result |= std::size_t(p[2]) << 16;
> > > +       [[gnu::fallthrough]];
> > > +      case 2:
> > > +       result |= std::size_t(p[1]) << 8;
> > > +       [[gnu::fallthrough]];
> > > +      case 1:
> > > +       result |= std::size_t(p[0]);
> > > +      };
> > >      return result;
> > >    }
> > >
> > > --
> > > 2.43.5
> > >

Dmitry Ilvokhin Oct. 3, 2024, 12:25 p.m. UTC | #4

On Wed, Oct 02, 2024 at 08:15:38PM +0100, Jonathan Wakely wrote:
> On Wed, 2 Oct 2024 at 19:25, Jonathan Wakely <jwakely@redhat.com> wrote:
> >
> > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> > >
> > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > > >
> > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > >
> > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > >
> > > > libstdc++-v3/ChangeLog:
> > > >
> > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > >           loop using switch statement.
> > > >
> > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > > ---
> > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > index 3665375096a..294a7323dd0 100644
> > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > @@ -50,10 +50,29 @@ namespace
> > > >    load_bytes(const char* p, int n)
> > > >    {
> > > >      std::size_t result = 0;
> > > > -    --n;
> > > > -    do
> > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > -    while (--n >= 0);
> > >
> > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > only hash the first 8 bytes.
> >
> > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> 
> It seems to be slower for short strings, but a win overall:
> https://quick-bench.com/q/xhh5m1akZzwUAXRiYJ17z9FASc8
> This measures different lengths, and tries to ensure that the string
> contents aren't treated as constant.

Nice find, thanks for spotting it.

In retrospect, I think case n = 7 is best for unrolled switch version
and worst for loop version, because it maximize overhead from loop
control flow instructions. In contrast, case n = 1 is completely
opposite in that sense, we have minimal overhead from loop control flow.

If explanation above is correct, then not sure we can do something
better for small cases. Seems like this lunch is not completely free.

> 
> >
> >
> > >
> > > > +    switch(n & 7)
> > > > +      {
> > > > +      case 7:
> > > > +       result |= std::size_t(p[6]) << 48;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 6:
> > > > +       result |= std::size_t(p[5]) << 40;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 5:
> > > > +       result |= std::size_t(p[4]) << 32;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 4:
> > > > +       result |= std::size_t(p[3]) << 24;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 3:
> > > > +       result |= std::size_t(p[2]) << 16;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 2:
> > > > +       result |= std::size_t(p[1]) << 8;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 1:
> > > > +       result |= std::size_t(p[0]);
> > > > +      };
> > > >      return result;
> > > >    }
> > > >
> > > > --
> > > > 2.43.5
> > > >
>

Forgot to CC mailing lists. Sorry about that.

Richard Biener Oct. 4, 2024, 6:53 a.m. UTC | #5

On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> >
> > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > >
> > > Instead of looping over every byte of the tail, unroll loop manually
> > > using switch statement, then compilers (at least GCC and Clang) will
> > > generate a jump table [1], which is faster on a microbenchmark [2].
> > >
> > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > >
> > > libstdc++-v3/ChangeLog:
> > >
> > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > >           loop using switch statement.
> > >
> > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > ---
> > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > index 3665375096a..294a7323dd0 100644
> > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > @@ -50,10 +50,29 @@ namespace
> > >    load_bytes(const char* p, int n)
> > >    {
> > >      std::size_t result = 0;
> > > -    --n;
> > > -    do
> > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > -    while (--n >= 0);
> >
> > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > only hash the first 8 bytes.
>
> Ah, but it's only ever called with load_bytes(end, len & 0x7)

The compiler should do such transforms - you probably want to tell
it that n < 8 though, it likely doesn't (always) know.

>
>
> >
> > > +    switch(n & 7)
> > > +      {
> > > +      case 7:
> > > +       result |= std::size_t(p[6]) << 48;
> > > +       [[gnu::fallthrough]];
> > > +      case 6:
> > > +       result |= std::size_t(p[5]) << 40;
> > > +       [[gnu::fallthrough]];
> > > +      case 5:
> > > +       result |= std::size_t(p[4]) << 32;
> > > +       [[gnu::fallthrough]];
> > > +      case 4:
> > > +       result |= std::size_t(p[3]) << 24;
> > > +       [[gnu::fallthrough]];
> > > +      case 3:
> > > +       result |= std::size_t(p[2]) << 16;
> > > +       [[gnu::fallthrough]];
> > > +      case 2:
> > > +       result |= std::size_t(p[1]) << 8;
> > > +       [[gnu::fallthrough]];
> > > +      case 1:
> > > +       result |= std::size_t(p[0]);
> > > +      };
> > >      return result;
> > >    }
> > >
> > > --
> > > 2.43.5
> > >
>

Jonathan Wakely Oct. 4, 2024, 9:19 a.m. UTC | #6

On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> >
> > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> > >
> > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > > >
> > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > >
> > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > >
> > > > libstdc++-v3/ChangeLog:
> > > >
> > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > >           loop using switch statement.
> > > >
> > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > > ---
> > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > index 3665375096a..294a7323dd0 100644
> > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > @@ -50,10 +50,29 @@ namespace
> > > >    load_bytes(const char* p, int n)
> > > >    {
> > > >      std::size_t result = 0;
> > > > -    --n;
> > > > -    do
> > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > -    while (--n >= 0);
> > >
> > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > only hash the first 8 bytes.
> >
> > Ah, but it's only ever called with load_bytes(end, len & 0x7)
>
> The compiler should do such transforms - you probably want to tell
> it that n < 8 though, it likely doesn't (always) know.

e.g. like this?

    if ((n & 7) != n)
      __builtin_unreachable();

For the microbenchmark that seems to make things consistently worse:
https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw

Jonathan Wakely Oct. 4, 2024, 9:20 a.m. UTC | #7

On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> > >
> > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> > > >
> > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > > > >
> > > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > > >
> > > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > > >
> > > > > libstdc++-v3/ChangeLog:
> > > > >
> > > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > > >           loop using switch statement.
> > > > >
> > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > > > ---
> > > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > index 3665375096a..294a7323dd0 100644
> > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > @@ -50,10 +50,29 @@ namespace
> > > > >    load_bytes(const char* p, int n)
> > > > >    {
> > > > >      std::size_t result = 0;
> > > > > -    --n;
> > > > > -    do
> > > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > > -    while (--n >= 0);
> > > >
> > > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > > only hash the first 8 bytes.
> > >
> > > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> >
> > The compiler should do such transforms - you probably want to tell
> > it that n < 8 though, it likely doesn't (always) know.
>
> e.g. like this?
>
>     if ((n & 7) != n)
>       __builtin_unreachable();
>
> For the microbenchmark that seems to make things consistently worse:
> https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw

Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <=
n is always true too.

Richard Biener Oct. 4, 2024, 11:40 a.m. UTC | #8

On Fri, Oct 4, 2024 at 11:20 AM Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote:
> >
> > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> > > >
> > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> > > > >
> > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > > > > >
> > > > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > > > >
> > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > > > >
> > > > > > libstdc++-v3/ChangeLog:
> > > > > >
> > > > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > > > >           loop using switch statement.
> > > > > >
> > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > > > > ---
> > > > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > index 3665375096a..294a7323dd0 100644
> > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > @@ -50,10 +50,29 @@ namespace
> > > > > >    load_bytes(const char* p, int n)
> > > > > >    {
> > > > > >      std::size_t result = 0;
> > > > > > -    --n;
> > > > > > -    do
> > > > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > > > -    while (--n >= 0);
> > > > >
> > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > > > only hash the first 8 bytes.
> > > >
> > > > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> > >
> > > The compiler should do such transforms - you probably want to tell
> > > it that n < 8 though, it likely doesn't (always) know.
> >
> > e.g. like this?
> >
> >     if ((n & 7) != n)
> >       __builtin_unreachable();
> >
> > For the microbenchmark that seems to make things consistently worse:
> > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw
>
> Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <=
> n is always true too.

Yes, expose the tightest range to the compiler.  Of course this might trigger
opportunistic optimiztation like complete peeling.

Richard.

Dmitry Ilvokhin Oct. 4, 2024, 12:53 p.m. UTC | #9

On Fri, Oct 04, 2024 at 10:20:27AM +0100, Jonathan Wakely wrote:
> On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote:
> >
> > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> > > >
> > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> > > > >
> > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > > > > >
> > > > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > > > >
> > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > > > >
> > > > > > libstdc++-v3/ChangeLog:
> > > > > >
> > > > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > > > >           loop using switch statement.
> > > > > >
> > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > > > > ---
> > > > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > index 3665375096a..294a7323dd0 100644
> > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > @@ -50,10 +50,29 @@ namespace
> > > > > >    load_bytes(const char* p, int n)
> > > > > >    {
> > > > > >      std::size_t result = 0;
> > > > > > -    --n;
> > > > > > -    do
> > > > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > > > -    while (--n >= 0);
> > > > >
> > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > > > only hash the first 8 bytes.
> > > >
> > > > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> > >
> > > The compiler should do such transforms - you probably want to tell
> > > it that n < 8 though, it likely doesn't (always) know.
> >
> > e.g. like this?
> >
> >     if ((n & 7) != n)
> >       __builtin_unreachable();
> >
> > For the microbenchmark that seems to make things consistently worse:
> > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw
> 
> Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <=
> n is always true too.
>

GCC still wasn't able to unroll the loop, even with a
__builtin_unreachable, but benchmark link you mentioned above uses -O2
optimization level (not sure if it was intentional).

If we'll use -O3 [1], then GCC was able to unroll the loop for
load_bytes_loop_assume version, but at the same time I am not sure all
loop control instructions were elided, I still can see them on Godbolt
version of generated code [2]. Benchmark charts partially confirm that,
because performance of load_bytes_loop and load_bytes_loop_assume are
now quite close (same actually, except case n = 1). I guess it would
make sense, as we execute same amount of instructions.

In addition, chart for load_bytes_switch look quite jumpy for [1] and
became better for cases n = 1 and n = 2. At this point I am not sure it
is not a code alignment issue and we are not measuring noise.

[1]: https://quick-bench.com/q/LlcgMVhL61CasZVjCWbHd3uid8w
[2]: https://godbolt.org/z/qPf1n7xWs

Jonathan Wakely Oct. 4, 2024, 1:07 p.m. UTC | #10

On Fri, 4 Oct 2024 at 13:53, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> On Fri, Oct 04, 2024 at 10:20:27AM +0100, Jonathan Wakely wrote:
> > On Fri, 4 Oct 2024 at 10:19, Jonathan Wakely <jwakely@redhat.com> wrote:
> > >
> > > On Fri, 4 Oct 2024 at 07:53, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Wed, Oct 2, 2024 at 8:26 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> > > > >
> > > > > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwakely@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > > > > > >
> > > > > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > > > > >
> > > > > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > > > > >
> > > > > > > libstdc++-v3/ChangeLog:
> > > > > > >
> > > > > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > > > > >           loop using switch statement.
> > > > > > >
> > > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > > > > > > ---
> > > > > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > > > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > > index 3665375096a..294a7323dd0 100644
> > > > > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > > > > @@ -50,10 +50,29 @@ namespace
> > > > > > >    load_bytes(const char* p, int n)
> > > > > > >    {
> > > > > > >      std::size_t result = 0;
> > > > > > > -    --n;
> > > > > > > -    do
> > > > > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > > > > -    while (--n >= 0);
> > > > > >
> > > > > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > > > > only hash the first 8 bytes.
> > > > >
> > > > > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> > > >
> > > > The compiler should do such transforms - you probably want to tell
> > > > it that n < 8 though, it likely doesn't (always) know.
> > >
> > > e.g. like this?
> > >
> > >     if ((n & 7) != n)
> > >       __builtin_unreachable();
> > >
> > > For the microbenchmark that seems to make things consistently worse:
> > > https://quick-bench.com/q/2yCEqzFS8R8ueJ0-Gs-sZ6uWWEw
> >
> > Oh actually in the benchmark I used (!(1 <= n && n < 8)) because 1 <=
> > n is always true too.
> >
>
> GCC still wasn't able to unroll the loop, even with a
> __builtin_unreachable, but benchmark link you mentioned above uses -O2
> optimization level (not sure if it was intentional).

That was intentional, because that's how libsupc++/hash_bytes.cc gets compiled.

>
> If we'll use -O3 [1], then GCC was able to unroll the loop for
> load_bytes_loop_assume version, but at the same time I am not sure all
> loop control instructions were elided, I still can see them on Godbolt
> version of generated code [2]. Benchmark charts partially confirm that,
> because performance of load_bytes_loop and load_bytes_loop_assume are
> now quite close (same actually, except case n = 1). I guess it would
> make sense, as we execute same amount of instructions.
>
> In addition, chart for load_bytes_switch look quite jumpy for [1] and
> became better for cases n = 1 and n = 2. At this point I am not sure it
> is not a code alignment issue and we are not measuring noise.
>
> [1]: https://quick-bench.com/q/LlcgMVhL61CasZVjCWbHd3uid8w
> [2]: https://godbolt.org/z/qPf1n7xWs
>

libstdc++: Unroll loop in load_bytes function

Commit Message

Comments

Patch