libstdc++: Optimize std::gcd

Message ID	9e5636e8-224f-4dde-9c60-489b4aad6534@gmail.com
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2588939B2ECE Message-ID: <9e5636e8-224f-4dde-9c60-489b4aad6534@gmail.com> Date: Thu, 6 Jun 2024 22:29:23 -0700 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US From: Stephen Face <shpface@gmail.com> Subject: [PATCH] libstdc++: Optimize std::gcd To: libstdc++@gcc.gnu.org Cc: gcc-patches@gcc.gnu.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: list Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	libstdc++: Optimize std::gcd \| expand libstdc++: Optimize std::gcd

Message ID

9e5636e8-224f-4dde-9c60-489b4aad6534@gmail.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2588939B2ECE
Message-ID: <9e5636e8-224f-4dde-9c60-489b4aad6534@gmail.com>
Date: Thu, 6 Jun 2024 22:29:23 -0700
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Content-Language: en-US
From: Stephen Face <shpface@gmail.com>
Subject: [PATCH] libstdc++: Optimize std::gcd
To: libstdc++@gcc.gnu.org
Cc: gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: list
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

Series

libstdc++: Optimize std::gcd | expand

Commit Message

Stephen Face June 7, 2024, 5:29 a.m. UTC

This patch is to optimize the runtime execution of gcd. Mathematically,
it computes with the same algorithm as before, but subtractions and
branches are rearranged to encourage generation of code that can use
flags from the subtractions for conditional moves. Additionally, most
pairs of integers are coprime, so this patch also includes a check for
one of the integers to be equal to 1, and then it will exit the loop
early in this case.

libstdc++-v3/ChangeLog:

	* include/std/numeric(__gcd): Optimize.
---
I have tested this on x86_64-linux and aarch64-linux. I have tested the
timing with random distributions of small inputs and large inputs on a
couple of machines with -O2 and found decreases in execution time from
20% to 60% depending on the machine and distribution of inputs.

 libstdc++-v3/include/std/numeric | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

Comments

Sam James June 7, 2024, 8:57 a.m. UTC | #1

Stephen Face <shpface@gmail.com> writes:

> This patch is to optimize the runtime execution of gcd. Mathematically,
> it computes with the same algorithm as before, but subtractions and
> branches are rearranged to encourage generation of code that can use
> flags from the subtractions for conditional moves. Additionally, most
> pairs of integers are coprime, so this patch also includes a check for
> one of the integers to be equal to 1, and then it will exit the loop
> early in this case.

Is it worth filing a bug for the missed optimisation? You shouldn't have
to write things in a specific order. Thanks.

>
> libstdc++-v3/ChangeLog:
>
> 	* include/std/numeric(__gcd): Optimize.
> ---
> I have tested this on x86_64-linux and aarch64-linux. I have tested the
> timing with random distributions of small inputs and large inputs on a
> couple of machines with -O2 and found decreases in execution time from
> 20% to 60% depending on the machine and distribution of inputs.
>
>  libstdc++-v3/include/std/numeric | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
> index c912db4a519..3c9e8387a0e 100644
> --- a/libstdc++-v3/include/std/numeric
> +++ b/libstdc++-v3/include/std/numeric
> @@ -148,19 +148,20 @@ namespace __detail
>  
>        while (true)
>  	{
> -	  if (__m > __n)
> -	    {
> -	      _Tp __tmp = __m;
> -	      __m = __n;
> -	      __n = __tmp;
> -	    }
> +	  _Tp __m_minus_n = __m - __n;
> +	  if (__m_minus_n == 0)
> +	    return __m << __k;
>  
> -	  __n -= __m;
> +	  _Tp __next_m = __m < __n ? __m : __n;
>  
> -	  if (__n == 0)
> -	    return __m << __k;
> +	  if (__next_m == 1)
> +	    return __next_m << __k;
> +
> +	  _Tp __n_minus_m = __n - __m;
> +	  __n = __n < __m ? __m_minus_n : __n_minus_m;
> +	  __m = __next_m;
>  
> -	  __n >>= std::__countr_zero(__n);
> +	  __n >>= std::__countr_zero(__m_minus_n);
>  	}
>      }
>  } // namespace __detail

Jonathan Wakely June 7, 2024, 9:30 a.m. UTC | #2

On Fri, 7 Jun 2024 at 09:57, Sam James <sam@gentoo.org> wrote:
>
> Stephen Face <shpface@gmail.com> writes:
>
> > This patch is to optimize the runtime execution of gcd. Mathematically,
> > it computes with the same algorithm as before, but subtractions and
> > branches are rearranged to encourage generation of code that can use
> > flags from the subtractions for conditional moves. Additionally, most
> > pairs of integers are coprime, so this patch also includes a check for
> > one of the integers to be equal to 1, and then it will exit the loop
> > early in this case.
>
> Is it worth filing a bug for the missed optimisation? You shouldn't have
> to write things in a specific order. Thanks.

Yes, I think a bug report would be good. But 20%-60% decreases in run
time seems significant enough that we should take the libstdc++ patch
now, rather than wait for a possible compiler fix to come later.

Stephen, could you please confirm whether you have a copyright
assignment in place for GCC, and if not whether you would be will to
complete one, or alternatively contribute this under the DCO terms:
https://gcc.gnu.org/dco.html
Thanks!


>
> >
> > libstdc++-v3/ChangeLog:
> >
> >       * include/std/numeric(__gcd): Optimize.
> > ---
> > I have tested this on x86_64-linux and aarch64-linux. I have tested the
> > timing with random distributions of small inputs and large inputs on a
> > couple of machines with -O2 and found decreases in execution time from
> > 20% to 60% depending on the machine and distribution of inputs.
> >
> >  libstdc++-v3/include/std/numeric | 21 +++++++++++----------
> >  1 file changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
> > index c912db4a519..3c9e8387a0e 100644
> > --- a/libstdc++-v3/include/std/numeric
> > +++ b/libstdc++-v3/include/std/numeric
> > @@ -148,19 +148,20 @@ namespace __detail
> >
> >        while (true)
> >       {
> > -       if (__m > __n)
> > -         {
> > -           _Tp __tmp = __m;
> > -           __m = __n;
> > -           __n = __tmp;
> > -         }
> > +       _Tp __m_minus_n = __m - __n;
> > +       if (__m_minus_n == 0)
> > +         return __m << __k;
> >
> > -       __n -= __m;
> > +       _Tp __next_m = __m < __n ? __m : __n;
> >
> > -       if (__n == 0)
> > -         return __m << __k;
> > +       if (__next_m == 1)
> > +         return __next_m << __k;
> > +
> > +       _Tp __n_minus_m = __n - __m;
> > +       __n = __n < __m ? __m_minus_n : __n_minus_m;
> > +       __m = __next_m;
> >
> > -       __n >>= std::__countr_zero(__n);
> > +       __n >>= std::__countr_zero(__m_minus_n);
> >       }
> >      }
> >  } // namespace __detail
>

Stephen Face June 7, 2024, 6:42 p.m. UTC | #3

On 6/7/24 2:30 AM, Jonathan Wakely wrote:
> On Fri, 7 Jun 2024 at 09:57, Sam James <sam@gentoo.org> wrote:
>>
>> Stephen Face <shpface@gmail.com> writes:
>>
>>> This patch is to optimize the runtime execution of gcd. Mathematically,
>>> it computes with the same algorithm as before, but subtractions and
>>> branches are rearranged to encourage generation of code that can use
>>> flags from the subtractions for conditional moves. Additionally, most
>>> pairs of integers are coprime, so this patch also includes a check for
>>> one of the integers to be equal to 1, and then it will exit the loop
>>> early in this case.
>>
>> Is it worth filing a bug for the missed optimisation? You shouldn't have
>> to write things in a specific order. Thanks.

This patch is not a pure reordering. It has the added comparison with 1.
Also, the argument to the trailing zero count is now the result of the
subtraction before the comparison of m and n. This shortens the chain of
dependencies, which can help the loop run faster.

> 
> Yes, I think a bug report would be good. But 20%-60% decreases in run
> time seems significant enough that we should take the libstdc++ patch
> now, rather than wait for a possible compiler fix to come later.
> 
> Stephen, could you please confirm whether you have a copyright
> assignment in place for GCC, and if not whether you would be will to
> complete one, or alternatively contribute this under the DCO terms:
> https://gcc.gnu.org/dco.html
> Thanks!

I do not have a copyright assignment in place for GCC. I can certify to
the DCO terms for this contribution.

    Signed-off-by: Stephen Face <shpface@gmail.com>

> 
> 
>>
>>>
>>> libstdc++-v3/ChangeLog:
>>>
>>>       * include/std/numeric(__gcd): Optimize.
>>> ---
>>> I have tested this on x86_64-linux and aarch64-linux. I have tested the
>>> timing with random distributions of small inputs and large inputs on a
>>> couple of machines with -O2 and found decreases in execution time from
>>> 20% to 60% depending on the machine and distribution of inputs.
>>>
>>>  libstdc++-v3/include/std/numeric | 21 +++++++++++----------
>>>  1 file changed, 11 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
>>> index c912db4a519..3c9e8387a0e 100644
>>> --- a/libstdc++-v3/include/std/numeric
>>> +++ b/libstdc++-v3/include/std/numeric
>>> @@ -148,19 +148,20 @@ namespace __detail
>>>
>>>        while (true)
>>>       {
>>> -       if (__m > __n)
>>> -         {
>>> -           _Tp __tmp = __m;
>>> -           __m = __n;
>>> -           __n = __tmp;
>>> -         }
>>> +       _Tp __m_minus_n = __m - __n;
>>> +       if (__m_minus_n == 0)
>>> +         return __m << __k;
>>>
>>> -       __n -= __m;
>>> +       _Tp __next_m = __m < __n ? __m : __n;
>>>
>>> -       if (__n == 0)
>>> -         return __m << __k;
>>> +       if (__next_m == 1)
>>> +         return __next_m << __k;
>>> +
>>> +       _Tp __n_minus_m = __n - __m;
>>> +       __n = __n < __m ? __m_minus_n : __n_minus_m;
>>> +       __m = __next_m;
>>>
>>> -       __n >>= std::__countr_zero(__n);
>>> +       __n >>= std::__countr_zero(__m_minus_n);
>>>       }
>>>      }
>>>  } // namespace __detail
>>
>

Jonathan Wakely June 7, 2024, 6:48 p.m. UTC | #4

On Fri, 7 Jun 2024 at 19:42, Stephen Face <shpface@gmail.com> wrote:
>
> On 6/7/24 2:30 AM, Jonathan Wakely wrote:
> > On Fri, 7 Jun 2024 at 09:57, Sam James <sam@gentoo.org> wrote:
> >>
> >> Stephen Face <shpface@gmail.com> writes:
> >>
> >>> This patch is to optimize the runtime execution of gcd. Mathematically,
> >>> it computes with the same algorithm as before, but subtractions and
> >>> branches are rearranged to encourage generation of code that can use
> >>> flags from the subtractions for conditional moves. Additionally, most
> >>> pairs of integers are coprime, so this patch also includes a check for
> >>> one of the integers to be equal to 1, and then it will exit the loop
> >>> early in this case.
> >>
> >> Is it worth filing a bug for the missed optimisation? You shouldn't have
> >> to write things in a specific order. Thanks.
>
> This patch is not a pure reordering. It has the added comparison with 1.
> Also, the argument to the trailing zero count is now the result of the
> subtraction before the comparison of m and n. This shortens the chain of
> dependencies, which can help the loop run faster.
>
> >
> > Yes, I think a bug report would be good. But 20%-60% decreases in run
> > time seems significant enough that we should take the libstdc++ patch
> > now, rather than wait for a possible compiler fix to come later.
> >
> > Stephen, could you please confirm whether you have a copyright
> > assignment in place for GCC, and if not whether you would be will to
> > complete one, or alternatively contribute this under the DCO terms:
> > https://gcc.gnu.org/dco.html
> > Thanks!
>
> I do not have a copyright assignment in place for GCC. I can certify to
> the DCO terms for this contribution.
>
>     Signed-off-by: Stephen Face <shpface@gmail.com>

Thank you! I'll take care of this patch next week.


>
> >
> >
> >>
> >>>
> >>> libstdc++-v3/ChangeLog:
> >>>
> >>>       * include/std/numeric(__gcd): Optimize.
> >>> ---
> >>> I have tested this on x86_64-linux and aarch64-linux. I have tested the
> >>> timing with random distributions of small inputs and large inputs on a
> >>> couple of machines with -O2 and found decreases in execution time from
> >>> 20% to 60% depending on the machine and distribution of inputs.
> >>>
> >>>  libstdc++-v3/include/std/numeric | 21 +++++++++++----------
> >>>  1 file changed, 11 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
> >>> index c912db4a519..3c9e8387a0e 100644
> >>> --- a/libstdc++-v3/include/std/numeric
> >>> +++ b/libstdc++-v3/include/std/numeric
> >>> @@ -148,19 +148,20 @@ namespace __detail
> >>>
> >>>        while (true)
> >>>       {
> >>> -       if (__m > __n)
> >>> -         {
> >>> -           _Tp __tmp = __m;
> >>> -           __m = __n;
> >>> -           __n = __tmp;
> >>> -         }
> >>> +       _Tp __m_minus_n = __m - __n;
> >>> +       if (__m_minus_n == 0)
> >>> +         return __m << __k;
> >>>
> >>> -       __n -= __m;
> >>> +       _Tp __next_m = __m < __n ? __m : __n;
> >>>
> >>> -       if (__n == 0)
> >>> -         return __m << __k;
> >>> +       if (__next_m == 1)
> >>> +         return __next_m << __k;
> >>> +
> >>> +       _Tp __n_minus_m = __n - __m;
> >>> +       __n = __n < __m ? __m_minus_n : __n_minus_m;
> >>> +       __m = __next_m;
> >>>
> >>> -       __n >>= std::__countr_zero(__n);
> >>> +       __n >>= std::__countr_zero(__m_minus_n);
> >>>       }
> >>>      }
> >>>  } // namespace __detail
> >>
> >
>

diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
index c912db4a519..3c9e8387a0e 100644
--- a/libstdc++-v3/include/std/numeric
+++ b/libstdc++-v3/include/std/numeric
@@ -148,19 +148,20 @@  namespace __detail
 
       while (true)
 	{
-	  if (__m > __n)
-	    {
-	      _Tp __tmp = __m;
-	      __m = __n;
-	      __n = __tmp;
-	    }
+	  _Tp __m_minus_n = __m - __n;
+	  if (__m_minus_n == 0)
+	    return __m << __k;
 
-	  __n -= __m;
+	  _Tp __next_m = __m < __n ? __m : __n;
 
-	  if (__n == 0)
-	    return __m << __k;
+	  if (__next_m == 1)
+	    return __next_m << __k;
+
+	  _Tp __n_minus_m = __n - __m;
+	  __n = __n < __m ? __m_minus_n : __n_minus_m;
+	  __m = __next_m;
 
-	  __n >>= std::__countr_zero(__n);
+	  __n >>= std::__countr_zero(__m_minus_n);
 	}
     }
 } // namespace __detail

libstdc++: Optimize std::gcd

Commit Message

Comments

Patch