Message ID | 20230731183549.2396362-1-skpgkp2@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v2] x86_64: Optimize ffsll function code size. | expand |
On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote: > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > Currently ffsll function randomly regress by ~20%, depending on how > code get aligned. > > This patch fixes ffsll function random performance regression. > > Changes from v1: > - Further reduce size ffsll function size to 12 bytes. > --- > sysdeps/x86_64/ffsll.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > index a1c13d4906..6a5803c7c1 100644 > --- a/sysdeps/x86_64/ffsll.c > +++ b/sysdeps/x86_64/ffsll.c > @@ -26,13 +26,13 @@ int > ffsll (long long int x) > { > long long int cnt; > - long long int tmp; > > - asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ > + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ > + "inc %k0\n" /* Increment CNT by 1. */ > + : "=&r" (cnt) : "r" (x)); > > - return cnt + 1; > + return cnt; > } > > #ifndef __ILP32__ I still prefer if we can just remove this arch-optimized function in favor in compiler builtins.
On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto < adhemerval.zanella@linaro.org> wrote: > > > On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote: > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > Currently ffsll function randomly regress by ~20%, depending on how > > code get aligned. > > > > This patch fixes ffsll function random performance regression. > > > > Changes from v1: > > - Further reduce size ffsll function size to 12 bytes. > > --- > > sysdeps/x86_64/ffsll.c | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > index a1c13d4906..6a5803c7c1 100644 > > --- a/sysdeps/x86_64/ffsll.c > > +++ b/sysdeps/x86_64/ffsll.c > > @@ -26,13 +26,13 @@ int > > ffsll (long long int x) > > { > > long long int cnt; > > - long long int tmp; > > > > - asm ("bsfq %2,%0\n" /* Count low bits in X and store > in %1. */ > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as > result. */ > > - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > > + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ > > + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ > > + "inc %k0\n" /* Increment CNT by 1. */ > > + : "=&r" (cnt) : "r" (x)); > > > > - return cnt + 1; > > + return cnt; > > } > > > > #ifndef __ILP32__ > > > > I still prefer if we can just remove this arch-optimized function in favor > in compiler builtins. > Sure, compiler builtin should replace it in the long run. In the meantime, can it get fixed?
On 31/07/23 17:58, Sunil Pandey wrote: > > > On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> wrote: > > > > On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote: > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > Currently ffsll function randomly regress by ~20%, depending on how > > code get aligned. > > > > This patch fixes ffsll function random performance regression. > > > > Changes from v1: > > - Further reduce size ffsll function size to 12 bytes. > > --- > > sysdeps/x86_64/ffsll.c | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > index a1c13d4906..6a5803c7c1 100644 > > --- a/sysdeps/x86_64/ffsll.c > > +++ b/sysdeps/x86_64/ffsll.c > > @@ -26,13 +26,13 @@ int > > ffsll (long long int x) > > { > > long long int cnt; > > - long long int tmp; > > > > - asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > > - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > > + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ > > + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ > > + "inc %k0\n" /* Increment CNT by 1. */ > > + : "=&r" (cnt) : "r" (x)); > > > > - return cnt + 1; > > + return cnt; > > } > > > > #ifndef __ILP32__ > > > > I still prefer if we can just remove this arch-optimized function in favor > in compiler builtins. > > > Sure, compiler builtin should replace it in the long run. > In the meantime, can it get fixed? This fix only works if compiler does not insert anything in the prologue, if you use CET or stack protector strong it might not work. And I *really* do not want to add another assembly optimization to a symbol that won't be used in most real programs. And already have a fix to use compiler builtins [1]. [1] https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
On Mon, Jul 31, 2023 at 3:57 PM Adhemerval Zanella Netto < adhemerval.zanella@linaro.org> wrote: > > > On 31/07/23 17:58, Sunil Pandey wrote: > > > > > > On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto < > adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> > wrote: > > > > > > > > On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote: > > > Ffsll function size is 17 byte, this patch optimizes size to 16 > byte. > > > Currently ffsll function randomly regress by ~20%, depending on how > > > code get aligned. > > > > > > This patch fixes ffsll function random performance regression. > > > > > > Changes from v1: > > > - Further reduce size ffsll function size to 12 bytes. > > > --- > > > sysdeps/x86_64/ffsll.c | 10 +++++----- > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > > index a1c13d4906..6a5803c7c1 100644 > > > --- a/sysdeps/x86_64/ffsll.c > > > +++ b/sysdeps/x86_64/ffsll.c > > > @@ -26,13 +26,13 @@ int > > > ffsll (long long int x) > > > { > > > long long int cnt; > > > - long long int tmp; > > > > > > - asm ("bsfq %2,%0\n" /* Count low bits in X and > store in %1. */ > > > - "cmoveq %1,%0\n" /* If number was zero, use > -1 as result. */ > > > - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > > > + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ > > > + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ > > > + "inc %k0\n" /* Increment CNT by 1. */ > > > + : "=&r" (cnt) : "r" (x)); > > > > > > - return cnt + 1; > > > + return cnt; > > > } > > > > > > #ifndef __ILP32__ > > > > > > > > I still prefer if we can just remove this arch-optimized function in > favor > > in compiler builtins. > > > > > > Sure, compiler builtin should replace it in the long run. > > In the meantime, can it get fixed? > > This fix only works if compiler does not insert anything in the prologue, > if > you use CET or stack protector strong it might not work. And I *really* > do not want to add another assembly optimization to a symbol that won't > be used in most real programs. > v2 will fix it, as CET overhead is 4 byte and the latest code size is only 12 byte. So toal code size will be 16 byte even if CET gets enabled. > And already have a fix to use compiler builtins [1]. > Here is code generated from the builtin patch. (gdb) b ffsll Breakpoint 1 at 0x10a0 (gdb) run --direct Starting program: benchtests/bench-ffsll --direct Breakpoint 1, __ffsll (i=66900473708975203) at ffsll.c:30 30 return __builtin_ffsll (i); (gdb) disass Dump of assembler code for function __ffsll: => 0x00007ffff7c955a0 <+0>: bsf %rdi,%rdi 0x00007ffff7c955a4 <+4>: mov $0xffffffffffffffff,%rax 0x00007ffff7c955ab <+11>: cmove %rax,%rdi 0x00007ffff7c955af <+15>: lea 0x1(%rdi),%eax 0x00007ffff7c955b2 <+18>: ret End of assembler dump. (gdb) It is not going to fix the problem. Random 20% variation will continue even with builtin patch in benchmarking. I do not see anything wrong using architecture features, if it helps people save their valuable debugging time. After all, its valuable glibc feature to override generic implementation with architecture specific code. > [1] > https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/ >
On Mon, Jul 31, 2023 at 6:44 PM Sunil Pandey via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jul 31, 2023 at 3:57 PM Adhemerval Zanella Netto < > adhemerval.zanella@linaro.org> wrote: > > > > > > > On 31/07/23 17:58, Sunil Pandey wrote: > > > > > > > > > On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto < > > adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> > > wrote: > > > > > > > > > > > > On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote: > > > > Ffsll function size is 17 byte, this patch optimizes size to 16 > > byte. > > > > Currently ffsll function randomly regress by ~20%, depending on how > > > > code get aligned. > > > > > > > > This patch fixes ffsll function random performance regression. > > > > > > > > Changes from v1: > > > > - Further reduce size ffsll function size to 12 bytes. > > > > --- > > > > sysdeps/x86_64/ffsll.c | 10 +++++----- > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > > > index a1c13d4906..6a5803c7c1 100644 > > > > --- a/sysdeps/x86_64/ffsll.c > > > > +++ b/sysdeps/x86_64/ffsll.c > > > > @@ -26,13 +26,13 @@ int > > > > ffsll (long long int x) > > > > { > > > > long long int cnt; > > > > - long long int tmp; > > > > > > > > - asm ("bsfq %2,%0\n" /* Count low bits in X and > > store in %1. */ > > > > - "cmoveq %1,%0\n" /* If number was zero, use > > -1 as result. */ > > > > - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > > > > + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ > > > > + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ > > > > + "inc %k0\n" /* Increment CNT by 1. */ > > > > + : "=&r" (cnt) : "r" (x)); > > > > > > > > - return cnt + 1; > > > > + return cnt; > > > > } > > > > > > > > #ifndef __ILP32__ > > > > > > > > > > > > I still prefer if we can just remove this arch-optimized function in > > favor > > > in compiler builtins. > > > > > > > > > Sure, compiler builtin should replace it in the long run. > > > In the meantime, can it get fixed? > > > > This fix only works if compiler does not insert anything in the prologue, > > if > > you use CET or stack protector strong it might not work. And I *really* > > do not want to add another assembly optimization to a symbol that won't > > be used in most real programs. > > > > v2 will fix it, as CET overhead is 4 byte and the latest code size is only > 12 byte. > So toal code size will be 16 byte even if CET gets enabled. > > > > And already have a fix to use compiler builtins [1]. > > > > Here is code generated from the builtin patch. > > (gdb) b ffsll > Breakpoint 1 at 0x10a0 > (gdb) run --direct > Starting program: benchtests/bench-ffsll --direct > Breakpoint 1, __ffsll (i=66900473708975203) at ffsll.c:30 > 30 return __builtin_ffsll (i); > (gdb) disass > Dump of assembler code for function __ffsll: > => 0x00007ffff7c955a0 <+0>: bsf %rdi,%rdi > 0x00007ffff7c955a4 <+4>: mov $0xffffffffffffffff,%rax > 0x00007ffff7c955ab <+11>: cmove %rax,%rdi > 0x00007ffff7c955af <+15>: lea 0x1(%rdi),%eax > 0x00007ffff7c955b2 <+18>: ret > End of assembler dump. > (gdb) > > It is not going to fix the problem. Random 20% variation will continue even > with > builtin patch in benchmarking. > > I do not see anything wrong using architecture features, if it helps > people save their valuable debugging time. After all, its valuable > glibc feature to override generic implementation with architecture specific > code. > > > > [1] > > https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/ > > For my money this patch is fine. It's not really that important to optimize (or worth the time thats been spent in argument), and don't really see the harm in the asm impl, esp since GCC doesn't really produce good codegen for the builtin :(
On Jul 31 2023, Noah Goldstein via Libc-alpha wrote:
> esp since GCC doesn't really produce good codegen for the builtin :(
So this is a GCC bug. Please report it.
On 31/07/23 20:44, Sunil Pandey wrote: > > > It is not going to fix the problem. Random 20% variation will continue even with > builtin patch in benchmarking. > > I do not see anything wrong using architecture features, if it helps > people save their valuable debugging time. After all, its valuable > glibc feature to override generic implementation with architecture specific > code. > Because it is a synthetic benchmark that is exercising code that should not be stressed with the default compiler flags. And the problem of using arch-specific code is when you already have support from the compiler to generate optimized code, this is just extra complexity plus maintenance. On my system, ffssl is only used by a single program (/usr/bin/nvidia-persistenced) which I am not sure how it is compile because it a closed source one. The ffs is used by a couple more (gdb, lvm, and some mesa drivers), but again far from being a hotspot. And as Andreas have said, the best course of action here is to fix the compiler if it is not generating code enough code. Fixing gcc means that we will get any optimization by free (by using the builtin) and any code that actually uses ffs/ffsl/ffsll.
On Tue, Aug 1, 2023 at 9:47 AM Adhemerval Zanella Netto via Libc-alpha < libc-alpha@sourceware.org> wrote: > > > On 31/07/23 20:44, Sunil Pandey wrote: > > On my system, ffssl is only used by a single program > (/usr/bin/nvidia-persistenced) > which I am not sure how it is compile because it a closed source one. it is probably built with some strict-standard setting.. -ansi or whatever. > The ffs is > used by a couple more (gdb, lvm, and some mesa drivers), but again far > from being > a hotspot. > > https://cgit.freedesktop.org/drm/libdrm/tree/meson.build#n27 --> built with std=c11 instead of gnu11 therefore not benefiting from the compiler's use of builtins.. And as Andreas have said, the best course of action here is to fix the > compiler > if it is not generating code enough code. Fixing gcc means that we will > get > any optimization by free (by using the builtin) and any code that actually > uses > ffs/ffsl/ffsll. > I agree. the compiler already does this, if done suboptimally the ball is on GCC to fix it.
On 7/31/23 14:35, Sunil K Pandey via Libc-alpha wrote: > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > Currently ffsll function randomly regress by ~20%, depending on how > code get aligned. > > This patch fixes ffsll function random performance regression. Here is my suggestion. Commit this to glibc 2.39 as an incremental improvement that we can backport to any active branch. I have already approved the removal for glibc 2.40 via Adhemerval's patch. Reviewed-by: Carlos O'Donell <carlos@redhat.com> > Changes from v1: > - Further reduce size ffsll function size to 12 bytes. > --- > sysdeps/x86_64/ffsll.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > index a1c13d4906..6a5803c7c1 100644 > --- a/sysdeps/x86_64/ffsll.c > +++ b/sysdeps/x86_64/ffsll.c > @@ -26,13 +26,13 @@ int > ffsll (long long int x) > { > long long int cnt; > - long long int tmp; > > - asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ > + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ > + "inc %k0\n" /* Increment CNT by 1. */ > + : "=&r" (cnt) : "r" (x)); > > - return cnt + 1; > + return cnt; > } > > #ifndef __ILP32__
On Wed, Jan 10, 2024 at 11:19 AM Carlos O'Donell <carlos@redhat.com> wrote: > On 7/31/23 14:35, Sunil K Pandey via Libc-alpha wrote: > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > Currently ffsll function randomly regress by ~20%, depending on how > > code get aligned. > > > > This patch fixes ffsll function random performance regression. > > Here is my suggestion. > > Commit this to glibc 2.39 as an incremental improvement that we can > backport > to any active branch. > > I have already approved the removal for glibc 2.40 via Adhemerval's patch. > > Reviewed-by: Carlos O'Donell <carlos@redhat.com> > > > Changes from v1: > > - Further reduce size ffsll function size to 12 bytes. > > --- > > sysdeps/x86_64/ffsll.c | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > index a1c13d4906..6a5803c7c1 100644 > > --- a/sysdeps/x86_64/ffsll.c > > +++ b/sysdeps/x86_64/ffsll.c > > @@ -26,13 +26,13 @@ int > > ffsll (long long int x) > > { > > long long int cnt; > > - long long int tmp; > > > > - asm ("bsfq %2,%0\n" /* Count low bits in X and store > in %1. */ > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as > result. */ > > - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > > + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ > > + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ > > + "inc %k0\n" /* Increment CNT by 1. */ > > + : "=&r" (cnt) : "r" (x)); > > > > - return cnt + 1; > > + return cnt; > > } > > > > #ifndef __ILP32__ > > -- > Cheers, > Carlos. > I would like to backport this patch to release branches from 2.28 to 2.38. Any comments or objections? --Sunil
diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c index a1c13d4906..6a5803c7c1 100644 --- a/sysdeps/x86_64/ffsll.c +++ b/sysdeps/x86_64/ffsll.c @@ -26,13 +26,13 @@ int ffsll (long long int x) { long long int cnt; - long long int tmp; - asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ - : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); + asm ("mov $-1,%k0\n" /* Intialize CNT to -1. */ + "bsf %1,%0\n" /* Count low bits in X and store in CNT. */ + "inc %k0\n" /* Increment CNT by 1. */ + : "=&r" (cnt) : "r" (x)); - return cnt + 1; + return cnt; } #ifndef __ILP32__