Message ID | 20230310024420.521941-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v1] x86-64: Replace `%ah` write with `%eax` read | expand |
On Thu, Mar 9, 2023 at 6:44 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > High8 partial registers can incur a stall when being modified (if not > renamed seperately), or at the very least incur extra backend uops (if > renamed seperately). Either way `testl $0x0400, %eax` is preferable to > `andb $0x04, %ah`. > > Function size is unchanged when accounting for 16-byte padding. > --- > sysdeps/x86_64/fpu/e_fmodl.S | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S > index d754668bce..d45f984e1a 100644 > --- a/sysdeps/x86_64/fpu/e_fmodl.S > +++ b/sysdeps/x86_64/fpu/e_fmodl.S > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl) > fldt 8(%rsp) > 1: fprem > fstsw %ax > - and $04,%ah > + testl $0x400,%eax > jnz 1b > fstp %st(1) > ret > -- > 2.34.1 > OK. Thanks.
* Noah Goldstein via Libc-alpha: > High8 partial registers can incur a stall when being modified (if not > renamed seperately), or at the very least incur extra backend uops (if > renamed seperately). Either way `testl $0x0400, %eax` is preferable to > `andb $0x04, %ah`. > > Function size is unchanged when accounting for 16-byte padding. > --- > sysdeps/x86_64/fpu/e_fmodl.S | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S > index d754668bce..d45f984e1a 100644 > --- a/sysdeps/x86_64/fpu/e_fmodl.S > +++ b/sysdeps/x86_64/fpu/e_fmodl.S > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl) > fldt 8(%rsp) > 1: fprem > fstsw %ax > - and $04,%ah > + testl $0x400,%eax Why not test $0x400,%ax or test $04,%ah? Thanks, Florian
On Mon, Mar 13, 2023 at 3:03 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Noah Goldstein via Libc-alpha: > > > High8 partial registers can incur a stall when being modified (if not > > renamed seperately), or at the very least incur extra backend uops (if > > renamed seperately). Either way `testl $0x0400, %eax` is preferable to > > `andb $0x04, %ah`. > > > > Function size is unchanged when accounting for 16-byte padding. > > --- > > sysdeps/x86_64/fpu/e_fmodl.S | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S > > index d754668bce..d45f984e1a 100644 > > --- a/sysdeps/x86_64/fpu/e_fmodl.S > > +++ b/sysdeps/x86_64/fpu/e_fmodl.S > > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl) > > fldt 8(%rsp) > > 1: fprem > > fstsw %ax > > - and $04,%ah > > + testl $0x400,%eax > > Why not test $0x400,%ax or test $04,%ah? `test $0x400,%ax` uses imm16 which can cause length-changing-prefix (`0x66` in the opcode) stalls. `test $0x4,%ah` is more okay, but partial register usage has several delays associated with it (even pure reads), depends on arch but for example hwl/skl have 2c latency added (in this case where %ah is not being renamed seperately). In general, if you don't need the code size, best to stick with 32/64-bit instructions. > > Thanks, > Florian >
* Noah Goldstein: > On Mon, Mar 13, 2023 at 3:03 AM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Noah Goldstein via Libc-alpha: >> >> > High8 partial registers can incur a stall when being modified (if not >> > renamed seperately), or at the very least incur extra backend uops (if >> > renamed seperately). Either way `testl $0x0400, %eax` is preferable to >> > `andb $0x04, %ah`. >> > >> > Function size is unchanged when accounting for 16-byte padding. >> > --- >> > sysdeps/x86_64/fpu/e_fmodl.S | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S >> > index d754668bce..d45f984e1a 100644 >> > --- a/sysdeps/x86_64/fpu/e_fmodl.S >> > +++ b/sysdeps/x86_64/fpu/e_fmodl.S >> > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl) >> > fldt 8(%rsp) >> > 1: fprem >> > fstsw %ax >> > - and $04,%ah >> > + testl $0x400,%eax >> >> Why not test $0x400,%ax or test $04,%ah? > `test $0x400,%ax` uses imm16 which can cause length-changing-prefix > (`0x66` in the opcode) stalls. > `test $0x4,%ah` is more okay, but partial register usage has several > delays associated with it (even pure > reads), depends on arch but for example hwl/skl have 2c latency added > (in this case where %ah is not > being renamed seperately). > In general, if you don't need the code size, best to stick with > 32/64-bit instructions. Do we need to clear %eax first to avoid a false dependency? Thanks, Florian
On Mon, Mar 13, 2023 at 12:30 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Noah Goldstein: > > > On Mon, Mar 13, 2023 at 3:03 AM Florian Weimer <fweimer@redhat.com> wrote: > >> > >> * Noah Goldstein via Libc-alpha: > >> > >> > High8 partial registers can incur a stall when being modified (if not > >> > renamed seperately), or at the very least incur extra backend uops (if > >> > renamed seperately). Either way `testl $0x0400, %eax` is preferable to > >> > `andb $0x04, %ah`. > >> > > >> > Function size is unchanged when accounting for 16-byte padding. > >> > --- > >> > sysdeps/x86_64/fpu/e_fmodl.S | 2 +- > >> > 1 file changed, 1 insertion(+), 1 deletion(-) > >> > > >> > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S > >> > index d754668bce..d45f984e1a 100644 > >> > --- a/sysdeps/x86_64/fpu/e_fmodl.S > >> > +++ b/sysdeps/x86_64/fpu/e_fmodl.S > >> > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl) > >> > fldt 8(%rsp) > >> > 1: fprem > >> > fstsw %ax > >> > - and $04,%ah > >> > + testl $0x400,%eax > >> > >> Why not test $0x400,%ax or test $04,%ah? > > `test $0x400,%ax` uses imm16 which can cause length-changing-prefix > > (`0x66` in the opcode) stalls. > > `test $0x4,%ah` is more okay, but partial register usage has several > > delays associated with it (even pure > > reads), depends on arch but for example hwl/skl have 2c latency added > > (in this case where %ah is not > > being renamed seperately). > > In general, if you don't need the code size, best to stick with > > 32/64-bit instructions. > > Do we need to clear %eax first to avoid a false dependency? oh yeah, guess you're right, probably `test %ah` is best. > > Thanks, > Florian >
diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S index d754668bce..d45f984e1a 100644 --- a/sysdeps/x86_64/fpu/e_fmodl.S +++ b/sysdeps/x86_64/fpu/e_fmodl.S @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl) fldt 8(%rsp) 1: fprem fstsw %ax - and $04,%ah + testl $0x400,%eax jnz 1b fstp %st(1) ret