Message ID | 20230726160524.1955013-1-skpgkp2@gmail.com |
---|---|
State | New |
Headers | show |
Series | x86_64: Optimize ffsll function code size. | expand |
On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > Currently ffsll function randomly regress by ~20%, depending on how > code get aligned. > > This patch fixes ffsll function random performance regression. > --- > sysdeps/x86_64/ffsll.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > index a1c13d4906..dbded6f0a1 100644 > --- a/sysdeps/x86_64/ffsll.c > +++ b/sysdeps/x86_64/ffsll.c > @@ -29,7 +29,7 @@ ffsll (long long int x) > long long int tmp; > > asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. */ This no longer produces -1, but 0xffffffff in cnt. However, since the return type is 'int', cnt need not be 'long long int' either. I'm not sure why tmp exists at all, since cnt is the only register modified. r~
On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <richard.henderson@linaro.org> wrote: > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > Currently ffsll function randomly regress by ~20%, depending on how > > code get aligned. > > > > This patch fixes ffsll function random performance regression. > > --- > > sysdeps/x86_64/ffsll.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > index a1c13d4906..dbded6f0a1 100644 > > --- a/sysdeps/x86_64/ffsll.c > > +++ b/sysdeps/x86_64/ffsll.c > > @@ -29,7 +29,7 @@ ffsll (long long int x) > > long long int tmp; > > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. */ > > This no longer produces -1, but 0xffffffff in cnt. However, since the return type is > 'int', cnt need not be 'long long int' either. I'm not sure why tmp exists at all, since > cnt is the only register modified. > > > r~ tmp was initialized to -1 with : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); H.J.
On Wed, Jul 26, 2023 at 11:38 AM Richard Henderson via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > Currently ffsll function randomly regress by ~20%, depending on how > > code get aligned. > > > > This patch fixes ffsll function random performance regression. > > --- > > sysdeps/x86_64/ffsll.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > index a1c13d4906..dbded6f0a1 100644 > > --- a/sysdeps/x86_64/ffsll.c > > +++ b/sysdeps/x86_64/ffsll.c > > @@ -29,7 +29,7 @@ ffsll (long long int x) > > long long int tmp; > > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. */ Alternatively just store `-1` in dst before the `bsfq` and drop the cmov altogether. > > This no longer produces -1, but 0xffffffff in cnt. However, since the return type is > 'int', cnt need not be 'long long int' either. I'm not sure why tmp exists at all, since > cnt is the only register modified. > > > r~
On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson < richard.henderson@linaro.org> wrote: > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > Currently ffsll function randomly regress by ~20%, depending on how > > code get aligned. > > > > This patch fixes ffsll function random performance regression. > > --- > > sysdeps/x86_64/ffsll.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > index a1c13d4906..dbded6f0a1 100644 > > --- a/sysdeps/x86_64/ffsll.c > > +++ b/sysdeps/x86_64/ffsll.c > > @@ -29,7 +29,7 @@ ffsll (long long int x) > > long long int tmp; > > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store > in %1. */ > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as > result. */ > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. > */ > > This no longer produces -1, but 0xffffffff in cnt. However, since the > return type is > 'int', cnt need not be 'long long int' either. I'm not sure why tmp > exists at all, since > cnt is the only register modified. > Here is the exact assembly produced with this change. ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <ffsll>: 0: ba ff ff ff ff mov $0xffffffff,%edx 5: 48 0f bc c7 bsf %rdi,%rax 9: 0f 44 c2 cmove %edx,%eax c: 83 c0 01 add $0x1,%eax f: c3 ret > > > r~ >
On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson < > richard.henderson@linaro.org> wrote: > > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > > Currently ffsll function randomly regress by ~20%, depending on how > > > code get aligned. > > > > > > This patch fixes ffsll function random performance regression. > > > --- > > > sysdeps/x86_64/ffsll.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > > index a1c13d4906..dbded6f0a1 100644 > > > --- a/sysdeps/x86_64/ffsll.c > > > +++ b/sysdeps/x86_64/ffsll.c > > > @@ -29,7 +29,7 @@ ffsll (long long int x) > > > long long int tmp; > > > > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store > > in %1. */ > > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as > > result. */ > > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. > > */ > > > > This no longer produces -1, but 0xffffffff in cnt. However, since the > > return type is > > 'int', cnt need not be 'long long int' either. I'm not sure why tmp > > exists at all, since > > cnt is the only register modified. > > > > Here is the exact assembly produced with this change. > ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64 > > > Disassembly of section .text: > > 0000000000000000 <ffsll>: > 0: ba ff ff ff ff mov $0xffffffff,%edx > 5: 48 0f bc c7 bsf %rdi,%rax > 9: 0f 44 c2 cmove %edx,%eax > c: 83 c0 01 add $0x1,%eax > f: c3 ret > FWIW it should be: ``` 0000000000000000 <.text>: 0: b8 ff ff ff ff mov $0xffffffff,%eax 5: 48 0f bc c7 bsf %rdi,%rax 9: ff c0 inc %eax ``` And since its in inline asm no reason not to get that. > > > > > > > r~ > >
On Wed, Jul 26, 2023 at 11:06 AM Sunil K Pandey via Libc-alpha <libc-alpha@sourceware.org> wrote: > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > Currently ffsll function randomly regress by ~20%, depending on how > code get aligned. > > This patch fixes ffsll function random performance regression. > --- > sysdeps/x86_64/ffsll.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > index a1c13d4906..dbded6f0a1 100644 > --- a/sysdeps/x86_64/ffsll.c > +++ b/sysdeps/x86_64/ffsll.c > @@ -29,7 +29,7 @@ ffsll (long long int x) > long long int tmp; > > asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. */ > : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > Note, there is technically a "bug" in this code in that we clobber "cc" but don't specify so. Mightly as well fix that. > return cnt + 1; > -- > 2.41.0 >
On 26/07/23 13:59, Noah Goldstein via Libc-alpha wrote: > On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha > <libc-alpha@sourceware.org> wrote: >> >> On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson < >> richard.henderson@linaro.org> wrote: >> >>> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: >>>> Ffsll function size is 17 byte, this patch optimizes size to 16 byte. >>>> Currently ffsll function randomly regress by ~20%, depending on how >>>> code get aligned. >>>> >>>> This patch fixes ffsll function random performance regression. >>>> --- >>>> sysdeps/x86_64/ffsll.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c >>>> index a1c13d4906..dbded6f0a1 100644 >>>> --- a/sysdeps/x86_64/ffsll.c >>>> +++ b/sysdeps/x86_64/ffsll.c >>>> @@ -29,7 +29,7 @@ ffsll (long long int x) >>>> long long int tmp; >>>> >>>> asm ("bsfq %2,%0\n" /* Count low bits in X and store >>> in %1. */ >>>> - "cmoveq %1,%0\n" /* If number was zero, use -1 as >>> result. */ >>>> + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. >>> */ >>> >>> This no longer produces -1, but 0xffffffff in cnt. However, since the >>> return type is >>> 'int', cnt need not be 'long long int' either. I'm not sure why tmp >>> exists at all, since >>> cnt is the only register modified. >>> >> >> Here is the exact assembly produced with this change. >> ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64 >> >> >> Disassembly of section .text: >> >> 0000000000000000 <ffsll>: >> 0: ba ff ff ff ff mov $0xffffffff,%edx >> 5: 48 0f bc c7 bsf %rdi,%rax >> 9: 0f 44 c2 cmove %edx,%eax >> c: 83 c0 01 add $0x1,%eax >> f: c3 ret >> > > FWIW it should be: > ``` > 0000000000000000 <.text>: > 0: b8 ff ff ff ff mov $0xffffffff,%eax > 5: 48 0f bc c7 bsf %rdi,%rax > 9: ff c0 inc %eax > ``` > > And since its in inline asm no reason not to get that. Can't we just use compiler builtins instead and remove a bunch of asm? GCC already optimizes ffsl/ffsll to builtin if the architecture allows it, so I think microptimizing it on libc.so using arch-specific is really not ideal. With gcc 13 and my patch [1] I see: $ objdump -d ./string/ffsll.os ./string/ffsll.os: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <__ffsll>: 0: 48 0f bc ff bsf %rdi,%rdi 4: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax b: 48 0f 44 f8 cmove %rax,%rdi f: 8d 47 01 lea 0x1(%rdi),%eax 12: c3 ret [1] https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
On Wed, Jul 26, 2023 at 10:05 AM Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Jul 26, 2023 at 11:06 AM Sunil K Pandey via Libc-alpha > <libc-alpha@sourceware.org> wrote: > > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > Currently ffsll function randomly regress by ~20%, depending on how > > code get aligned. > > > > This patch fixes ffsll function random performance regression. > > --- > > sysdeps/x86_64/ffsll.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > index a1c13d4906..dbded6f0a1 100644 > > --- a/sysdeps/x86_64/ffsll.c > > +++ b/sysdeps/x86_64/ffsll.c > > @@ -29,7 +29,7 @@ ffsll (long long int x) > > long long int tmp; > > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. */ > > : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); > > > Note, there is technically a "bug" in this code in that we clobber > "cc" but don't > specify so. > Mightly as well fix that. On x86, flags/cc is always considered as clobbered by inline-asm. ix86_md_asm_adjust will add the clobber if there was no asm flags output. Thanks, Andrew Pinski > > return cnt + 1; > > -- > > 2.41.0 > >
On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha > <libc-alpha@sourceware.org> wrote: > > > > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson < > > richard.henderson@linaro.org> wrote: > > > > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: > > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. > > > > Currently ffsll function randomly regress by ~20%, depending on how > > > > code get aligned. > > > > > > > > This patch fixes ffsll function random performance regression. > > > > --- > > > > sysdeps/x86_64/ffsll.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > > > > index a1c13d4906..dbded6f0a1 100644 > > > > --- a/sysdeps/x86_64/ffsll.c > > > > +++ b/sysdeps/x86_64/ffsll.c > > > > @@ -29,7 +29,7 @@ ffsll (long long int x) > > > > long long int tmp; > > > > > > > > asm ("bsfq %2,%0\n" /* Count low bits in X and > store > > > in %1. */ > > > > - "cmoveq %1,%0\n" /* If number was zero, use -1 > as > > > result. */ > > > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as > result. > > > */ > > > > > > This no longer produces -1, but 0xffffffff in cnt. However, since the > > > return type is > > > 'int', cnt need not be 'long long int' either. I'm not sure why tmp > > > exists at all, since > > > cnt is the only register modified. > > > > > > > Here is the exact assembly produced with this change. > > ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64 > > > > > > Disassembly of section .text: > > > > 0000000000000000 <ffsll>: > > 0: ba ff ff ff ff mov $0xffffffff,%edx > > 5: 48 0f bc c7 bsf %rdi,%rax > > 9: 0f 44 c2 cmove %edx,%eax > > c: 83 c0 01 add $0x1,%eax > > f: c3 ret > > > > FWIW it should be: > ``` > 0000000000000000 <.text>: > 0: b8 ff ff ff ff mov $0xffffffff,%eax > 5: 48 0f bc c7 bsf %rdi,%rax > 9: ff c0 inc %eax > ``` > > And since its in inline asm no reason not to get that. > We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf advantage. > > > > > > > > > > > r~ > > > >
On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha >> <libc-alpha@sourceware.org> wrote: >> > >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson < >> > richard.henderson@linaro.org> wrote: >> > >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. >> > > > Currently ffsll function randomly regress by ~20%, depending on how >> > > > code get aligned. >> > > > >> > > > This patch fixes ffsll function random performance regression. >> > > > --- >> > > > sysdeps/x86_64/ffsll.c | 2 +- >> > > > 1 file changed, 1 insertion(+), 1 deletion(-) >> > > > >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c >> > > > index a1c13d4906..dbded6f0a1 100644 >> > > > --- a/sysdeps/x86_64/ffsll.c >> > > > +++ b/sysdeps/x86_64/ffsll.c >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x) >> > > > long long int tmp; >> > > > >> > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store >> > > in %1. */ >> > > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as >> > > result. */ >> > > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. >> > > */ >> > > >> > > This no longer produces -1, but 0xffffffff in cnt. However, since the >> > > return type is >> > > 'int', cnt need not be 'long long int' either. I'm not sure why tmp >> > > exists at all, since >> > > cnt is the only register modified. >> > > >> > >> > Here is the exact assembly produced with this change. >> > ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64 >> > >> > >> > Disassembly of section .text: >> > >> > 0000000000000000 <ffsll>: >> > 0: ba ff ff ff ff mov $0xffffffff,%edx >> > 5: 48 0f bc c7 bsf %rdi,%rax >> > 9: 0f 44 c2 cmove %edx,%eax >> > c: 83 c0 01 add $0x1,%eax >> > f: c3 ret >> > >> >> FWIW it should be: >> ``` >> 0000000000000000 <.text>: >> 0: b8 ff ff ff ff mov $0xffffffff,%eax >> 5: 48 0f bc c7 bsf %rdi,%rax >> 9: ff c0 inc %eax >> ``` >> >> And since its in inline asm no reason not to get that. > > > We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand > is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf > advantage. We rely on that behavior in other areas (memchr-evex512 for example). It's been confirmed it is defined to not changed the destination (like AMD). It saves an instructions.. Either way, however, I think adhemerval is right and we should use builtins for this. > > >> >> > >> > > >> > > >> > > r~ >> > >
On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > > > > > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > >> > >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha > >> <libc-alpha@sourceware.org> wrote: > >> > > >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson < > >> > richard.henderson@linaro.org> wrote: > >> > > >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: > >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 > byte. > >> > > > Currently ffsll function randomly regress by ~20%, depending on > how > >> > > > code get aligned. > >> > > > > >> > > > This patch fixes ffsll function random performance regression. > >> > > > --- > >> > > > sysdeps/x86_64/ffsll.c | 2 +- > >> > > > 1 file changed, 1 insertion(+), 1 deletion(-) > >> > > > > >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c > >> > > > index a1c13d4906..dbded6f0a1 100644 > >> > > > --- a/sysdeps/x86_64/ffsll.c > >> > > > +++ b/sysdeps/x86_64/ffsll.c > >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x) > >> > > > long long int tmp; > >> > > > > >> > > > asm ("bsfq %2,%0\n" /* Count low bits in X and > store > >> > > in %1. */ > >> > > > - "cmoveq %1,%0\n" /* If number was zero, use > -1 as > >> > > result. */ > >> > > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as > result. > >> > > */ > >> > > > >> > > This no longer produces -1, but 0xffffffff in cnt. However, since > the > >> > > return type is > >> > > 'int', cnt need not be 'long long int' either. I'm not sure why tmp > >> > > exists at all, since > >> > > cnt is the only register modified. > >> > > > >> > > >> > Here is the exact assembly produced with this change. > >> > ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64 > >> > > >> > > >> > Disassembly of section .text: > >> > > >> > 0000000000000000 <ffsll>: > >> > 0: ba ff ff ff ff mov $0xffffffff,%edx > >> > 5: 48 0f bc c7 bsf %rdi,%rax > >> > 9: 0f 44 c2 cmove %edx,%eax > >> > c: 83 c0 01 add $0x1,%eax > >> > f: c3 ret > >> > > >> > >> FWIW it should be: > >> ``` > >> 0000000000000000 <.text>: > >> 0: b8 ff ff ff ff mov $0xffffffff,%eax > >> 5: 48 0f bc c7 bsf %rdi,%rax > >> 9: ff c0 inc %eax > >> ``` > >> > >> And since its in inline asm no reason not to get that. > > > > > > We shouldn't remove cmove because as per Intel BSF instruction manual, > if the content of source operand > > is 0, the content of the destination operand is undefined. Also > removing cmove doesn't provide any perf > > advantage. > > We rely on that behavior in other areas (memchr-evex512 for example). > It's been confirmed it is defined to not changed the destination (like > AMD). > > It saves an instructions.. > > Either way, however, I think adhemerval is right and we should use builtins > for this. > Few things to notice here. This code is more than 20 years old, not sure what all intel processors this code may be running. Builtin will have the same issue, unless it can produce code 16 byte or less. Also the purpose of this patch is to fix random unexpected ~20% perf loss. > > > > >> > >> > > >> > > > >> > > > >> > > r~ > >> > > >
On Wed, Jul 26, 2023 at 5:38 PM Sunil Pandey <skpgkp2@gmail.com> wrote: > > > > On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote: >> > >> > >> > >> > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> >> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha >> >> <libc-alpha@sourceware.org> wrote: >> >> > >> >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson < >> >> > richard.henderson@linaro.org> wrote: >> >> > >> >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote: >> >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte. >> >> > > > Currently ffsll function randomly regress by ~20%, depending on how >> >> > > > code get aligned. >> >> > > > >> >> > > > This patch fixes ffsll function random performance regression. >> >> > > > --- >> >> > > > sysdeps/x86_64/ffsll.c | 2 +- >> >> > > > 1 file changed, 1 insertion(+), 1 deletion(-) >> >> > > > >> >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c >> >> > > > index a1c13d4906..dbded6f0a1 100644 >> >> > > > --- a/sysdeps/x86_64/ffsll.c >> >> > > > +++ b/sysdeps/x86_64/ffsll.c >> >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x) >> >> > > > long long int tmp; >> >> > > > >> >> > > > asm ("bsfq %2,%0\n" /* Count low bits in X and store >> >> > > in %1. */ >> >> > > > - "cmoveq %1,%0\n" /* If number was zero, use -1 as >> >> > > result. */ >> >> > > > + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. >> >> > > */ >> >> > > >> >> > > This no longer produces -1, but 0xffffffff in cnt. However, since the >> >> > > return type is >> >> > > 'int', cnt need not be 'long long int' either. I'm not sure why tmp >> >> > > exists at all, since >> >> > > cnt is the only register modified. >> >> > > >> >> > >> >> > Here is the exact assembly produced with this change. >> >> > ./build-x86_64-linux/string/ffsll.o: file format elf64-x86-64 >> >> > >> >> > >> >> > Disassembly of section .text: >> >> > >> >> > 0000000000000000 <ffsll>: >> >> > 0: ba ff ff ff ff mov $0xffffffff,%edx >> >> > 5: 48 0f bc c7 bsf %rdi,%rax >> >> > 9: 0f 44 c2 cmove %edx,%eax >> >> > c: 83 c0 01 add $0x1,%eax >> >> > f: c3 ret >> >> > >> >> >> >> FWIW it should be: >> >> ``` >> >> 0000000000000000 <.text>: >> >> 0: b8 ff ff ff ff mov $0xffffffff,%eax >> >> 5: 48 0f bc c7 bsf %rdi,%rax >> >> 9: ff c0 inc %eax >> >> ``` >> >> >> >> And since its in inline asm no reason not to get that. >> > >> > >> > We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand >> > is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf >> > advantage. >> >> We rely on that behavior in other areas (memchr-evex512 for example). >> It's been confirmed it is defined to not changed the destination (like AMD). >> >> It saves an instructions.. >> >> Either way, however, I think adhemerval is right and we should use builtins >> for this. > > > Few things to notice here. > > This code is more than 20 years old, not sure what all intel processors this code may be running. > Builtin will have the same issue, unless it can produce code 16 byte or less. > Also the purpose of this patch is to fix random unexpected ~20% perf loss. > Likewise for the string/memory library.... Why not just update it w.o cmov? Just seems like a waste not to address it while we're at it. >> > >> > >> >> >> >> > >> >> > > >> >> > > >> >> > > r~ >> >> > >
On Wed, Jul 26, 2023 at 1:12 PM Adhemerval Zanella Netto via Libc-alpha < libc-alpha@sourceware.org> wrote: > > > Can't we just use compiler builtins instead and remove a bunch of asm? > Yes, please. and any better code generation strategy is better directed at gcc not the glibc.
* Noah Goldstein via Libc-alpha: > Likewise for the string/memory library.... > Why not just update it w.o cmov? Just seems like a waste not to > address it while we're at it. Given that it violates the spec, doing it in an inline function seems kind of risky. Thanks, Florian
On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote: > * Noah Goldstein via Libc-alpha: > > > Likewise for the string/memory library.... > > Why not just update it w.o cmov? Just seems like a waste not to > > address it while we're at it. > > Given that it violates the spec, doing it in an inline function seems > kind of risky. Sorry, what inline function? The function seems to modify an implementation in ffsll.c, nothing else. Alexander
* Alexander Monakov: > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote: > >> * Noah Goldstein via Libc-alpha: >> >> > Likewise for the string/memory library.... >> > Why not just update it w.o cmov? Just seems like a waste not to >> > address it while we're at it. >> >> Given that it violates the spec, doing it in an inline function seems >> kind of risky. > > Sorry, what inline function? The function seems to modify an implementation > in ffsll.c, nothing else. Yeah, sorry, I was confused. There's a GCC built-in, right? So the glibc implementation probably isn't that important on x86. Thanks, Florian
On Thu, Jul 27, 2023 at 8:11 AM Florian Weimer via Libc-alpha < libc-alpha@sourceware.org> wrote: > > Yeah, sorry, I was confused. There's a GCC built-in, right? Yes, in all targets that have an ffs pattern,,the others need libgcc
On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote: > * Alexander Monakov: > > > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote: > > > >> * Noah Goldstein via Libc-alpha: > >> > >> > Likewise for the string/memory library.... > >> > Why not just update it w.o cmov? Just seems like a waste not to > >> > address it while we're at it. > >> > >> Given that it violates the spec, doing it in an inline function seems > >> kind of risky. > > > > Sorry, what inline function? The function seems to modify an implementation > > in ffsll.c, nothing else. > > Yeah, sorry, I was confused. There's a GCC built-in, right? So the > glibc implementation probably isn't that important on x86. Yep. The built-in gets expanded even at -Os, so it's quite unusual that Glibc's implementation get called. I see the following possibilities: * at -O0 (but then performance doesn't matter, presumably) * with -fno-builtin * when called via a function pointer Sunil, could you clear this up, please? Alexander
On Thu, Jul 27, 2023 at 7:00 AM Alexander Monakov <amonakov@ispras.ru> wrote: > > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote: > > > * Alexander Monakov: > > > > > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote: > > > > > >> * Noah Goldstein via Libc-alpha: > > >> > > >> > Likewise for the string/memory library.... > > >> > Why not just update it w.o cmov? Just seems like a waste not to > > >> > address it while we're at it. > > >> > > >> Given that it violates the spec, doing it in an inline function seems > > >> kind of risky. > > > > > > Sorry, what inline function? The function seems to modify an > implementation > > > in ffsll.c, nothing else. > > > > Yeah, sorry, I was confused. There's a GCC built-in, right? So the > > glibc implementation probably isn't that important on x86. > > Yep. The built-in gets expanded even at -Os, so it's quite unusual that > Glibc's > implementation get called. I see the following possibilities: > > * at -O0 (but then performance doesn't matter, presumably) > * with -fno-builtin > * when called via a function pointer > > Sunil, could you clear this up, please? > This issue only matters if ffsll functionality is implemented in a function(size > 16 byte) and the function is not inlined (doesn't matter whether it's implemented in C or assembly). By default the function entry point gets aligned to the 16 byte boundary, so the following layout are all valid. 64 byte aligned: No issue as 17 byte function will not cause another cache line load. 48 byte aligned: ~20% regression as 17 byte function will trigger another cache line load. 32 byte aligned: No issue as 17 byte function will not cause another cache line load. 16 byte aligned: No issue as 17 byte function will not cause another cache line load. Ffsll is one of the benchmark tests in the phoronix test suite, not sure how much it matters to the application. Lots of people involved in phoronix benchmark testing/tracking and this kind of random perf behavior wastes their time. Again I'm not against GCC, but if this function exists in glibc, I don't see any harm in fixing it. > > Alexander >
On Thu, 27 Jul 2023, Sunil Pandey wrote: > Ffsll is one of the benchmark tests in the phoronix test suite, not > sure how much it matters to the application. Lots of people involved > in phoronix benchmark testing/tracking and this kind of random perf > behavior wastes their time. Ah, I see — PTS runs Glibc benchtests (which use -fno-builtin). > Again I'm not against GCC, but if this function exists in glibc, I don't > see any harm in fixing it. Sure. But as Richard seems to be pointing out, the function seems needlessly obfuscated (even the first comment is wrong!). So if Noah's suggestion is not accepted, I can offer the following cleaned-up variant: int ffsll (long long int x) { int cnt; asm ("bsfq %1,%q0\n" /* Count low bits in X and store in %0. */ "cmovel %2,%0\n" /* If number was zero, use -1 as result. */ : "=&r" (cnt) : "r" (x), "r" (-1)); return cnt + 1; } (note that initial bsfq has an undesirable false dependency on the output operand, and Noah's variant fixes that too) Alexander
* Sunil Pandey: > Ffsll is one of the benchmark tests in the phoronix test suite, not > sure how much it matters to the application. Lots of people involved > in phoronix benchmark testing/tracking and this kind of random perf > behavior wastes their time. That's a good point. I've seen similar reports before (sadly I don't recall if they were specifically about ffsll). Regarding the mechanics of fixing it, if the instruction ordering and sizing is so sensitive, should this be an assembler implementation instead? And will the fix even work for distributions that build with --enable-cet, considering that there's going to be an additional 4-byte NOP at the start of the function? Thanks, Florian
On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote: > * Sunil Pandey: > >> Ffsll is one of the benchmark tests in the phoronix test suite, not >> sure how much it matters to the application. Lots of people involved >> in phoronix benchmark testing/tracking and this kind of random perf >> behavior wastes their time. > > That's a good point. I've seen similar reports before (sadly I don't > recall if they were specifically about ffsll). > > Regarding the mechanics of fixing it, if the instruction ordering and > sizing is so sensitive, should this be an assembler implementation > instead? And will the fix even work for distributions that build with > --enable-cet, considering that there's going to be an additional 4-byte > NOP at the start of the function? > Sigh... do we really need to care about this synthetic benchmark that is exercising a fallback path since compiler will most likely issue the inline builtin? And even this is really important, tune function alignment and size to fit on a cacheline should be done by the compiler, specially in the case where we can implement by using a builtin.
On Thu, Jul 27, 2023 at 9:24 AM Florian Weimer <fweimer@redhat.com> wrote: > * Sunil Pandey: > > > Ffsll is one of the benchmark tests in the phoronix test suite, not > > sure how much it matters to the application. Lots of people involved > > in phoronix benchmark testing/tracking and this kind of random perf > > behavior wastes their time. > > That's a good point. I've seen similar reports before (sadly I don't > recall if they were specifically about ffsll). > > Regarding the mechanics of fixing it, if the instruction ordering and > sizing is so sensitive, should this be an assembler implementation > instead? Instruction ordering and sizing pretty much always matter. Many instructions can't be used in low latency applications, because their encoding size is big. Unfortunately assemblers can't do much in this case. > And will the fix even work for distributions that build with > --enable-cet, considering that there's going to be an additional 4-byte > NOP at the start of the function? > Extra 4 byte from --enable-cet can affect performance of many small size highly optimized routines and can potentially create perf variation depending on how function gets loaded to memory. But again, this is one of the costs of using cet. > Thanks, > Florian > >
* Adhemerval Zanella Netto: > On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote: >> * Sunil Pandey: >> >>> Ffsll is one of the benchmark tests in the phoronix test suite, not >>> sure how much it matters to the application. Lots of people involved >>> in phoronix benchmark testing/tracking and this kind of random perf >>> behavior wastes their time. >> >> That's a good point. I've seen similar reports before (sadly I don't >> recall if they were specifically about ffsll). >> >> Regarding the mechanics of fixing it, if the instruction ordering and >> sizing is so sensitive, should this be an assembler implementation >> instead? And will the fix even work for distributions that build with >> --enable-cet, considering that there's going to be an additional 4-byte >> NOP at the start of the function? > Sigh... do we really need to care about this synthetic benchmark that is > exercising a fallback path since compiler will most likely issue the > inline builtin? And even this is really important, tune function alignment > and size to fit on a cacheline should be done by the compiler, specially > in the case where we can implement by using a builtin. I think avoiding wasting people's time with spurious benchmark differences is useful. Compared to things like the PID cache, this one seems pretty harmless. Maybe we should increase function alignment to cover the CET case, too? Thanks, Florian
On Thu, Jul 27, 2023 at 10:09 AM Florian Weimer <fweimer@redhat.com> wrote: > * Adhemerval Zanella Netto: > > > On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote: > >> * Sunil Pandey: > >> > >>> Ffsll is one of the benchmark tests in the phoronix test suite, not > >>> sure how much it matters to the application. Lots of people involved > >>> in phoronix benchmark testing/tracking and this kind of random perf > >>> behavior wastes their time. > >> > >> That's a good point. I've seen similar reports before (sadly I don't > >> recall if they were specifically about ffsll). > >> > >> Regarding the mechanics of fixing it, if the instruction ordering and > >> sizing is so sensitive, should this be an assembler implementation > >> instead? And will the fix even work for distributions that build with > >> --enable-cet, considering that there's going to be an additional 4-byte > >> NOP at the start of the function? > > > Sigh... do we really need to care about this synthetic benchmark that is > > exercising a fallback path since compiler will most likely issue the > > inline builtin? And even this is really important, tune function > alignment > > and size to fit on a cacheline should be done by the compiler, specially > > in the case where we can implement by using a builtin. > > I think avoiding wasting people's time with spurious benchmark > differences is useful. Compared to things like the PID cache, this one > seems pretty harmless. > > Maybe we should increase function alignment to cover the CET case, too? > Good point. I have seen some other cases where function alignment creates big perf swing. I think aligning function entry points to cache line size will fix most of the alignment related issues. Trade-off of bigger alignment will be increased memory usage. > Thanks, > Florian > >
diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c index a1c13d4906..dbded6f0a1 100644 --- a/sysdeps/x86_64/ffsll.c +++ b/sysdeps/x86_64/ffsll.c @@ -29,7 +29,7 @@ ffsll (long long int x) long long int tmp; asm ("bsfq %2,%0\n" /* Count low bits in X and store in %1. */ - "cmoveq %1,%0\n" /* If number was zero, use -1 as result. */ + "cmove %k1,%k0\n" /* If number was zero, use -1 as result. */ : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1)); return cnt + 1;