Message ID | 20150208110426.GA28729@domone |
---|---|
State | New |
Headers | show |
On Sun, 8 Feb 2015, Ondřej Bílka wrote: > Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? > > It is definitely slower with division, following code is 25% slower on > haswell with char than when you use long. It may boil down to the choice of instructions produced made by the compiler. I can hardly imagine 8-bit division to be slower than 64-bit one on a processor that implements subword integer arithmetic. > There is question what about other architectures and atomic operations, > are byte ones better than int? > > int main () > { > int i; > char x = 32; > for (i=0; i<1000000000; i++) > x = 11 * x + 5 + x / 3; > return x; > } On Intel Atom for example division latencies are as follows[1]: latency throughput IDIV r/m8 33 32 IDIV r/m16 42 41 IDIV r/m32 57 56 IDIV r/m64 197 196 I'd expect the ratio of elapsed times for the corresponding data widths and a manual division algorithm used with processors that have no hardware divider to be similar. There is no latency difference between individual data widths AFAICT for multiplication or general ALU operations. For processors that do have a hardware divider implementing word calculation only I'd expect either a constant latency or again, a decrease in operation time depending on the actual width of significant data contained in operands. For example the M14Kc MIPS32 processor has an RTL configuration option to include either an area-efficient or a high-performance MDU (multiply/divide unit). The area-efficient MDU has a latency of 33 clocks for unsigned division (signed division adds up to 2 clocks for sign reversal). The high-performance MDU reduces the latency as follows[2]: "Divide operations are implemented with a simple 1-bit-per-clock iterative algorithm. An early-in detection checks the sign extension of the dividend (rs) operand. If rs is 8 bits wide, 23 iterations are skipped. For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs, 7 iterations are skipped. Any attempt to issue a subsequent MDU instruction while a divide is still active causes an IU pipeline stall until the divide operation has completed." As it happens automatically there is no benefit from using a narrower data type, and the lack of subword arithmetic operations means that using such a type will require a truncation operation from time to time for multiplication or general ALU operations. > --- a/sysdeps/generic/stdint.h > +++ b/sysdeps/generic/stdint.h > @@ -87,12 +87,13 @@ typedef unsigned long long int uint_least64_t; > /* Fast types. */ > > /* Signed. */ > -typedef signed char int_fast8_t; > #if __WORDSIZE == 64 > +typedef long int int_fast8_t; > typedef long int int_fast16_t; > typedef long int int_fast32_t; > typedef long int int_fast64_t; > #else > +typedef int int_fast8_t; > typedef int int_fast16_t; > typedef int int_fast32_t; > __extension__ So I find the choice of types above to be already questionable for a generic header. By default I'd expect fast data types to have the same width as their fixed-width counterparts for the large benefit they provide with most architectures that do implement subword arithmetic weighed against the small loss they will likely incur with architectures that only implement word arithmetic. Then individual ports could override the defaults as they see fit. At this point the discussion is however I believe moot though -- there will have been relocatable objects out there with data embedded using these types already so the ABI has been set and I don't see a way of changing it without breaking binary compatibility. References: [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21 [2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS Technologies, Inc., Document Number: MD00672, November 2, 2009, Subsection "High-Performance MDU", p. 6 Maciej
On Mon, Feb 9, 2015 at 5:36 AM, Maciej W. Rozycki <macro@linux-mips.org> wrote: > On Sun, 8 Feb 2015, Ondřej Bílka wrote: > >> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? >> >> It is definitely slower with division, following code is 25% slower on >> haswell with char than when you use long. > > It may boil down to the choice of instructions produced made by the > compiler. I can hardly imagine 8-bit division to be slower than 64-bit > one on a processor that implements subword integer arithmetic. > >> There is question what about other architectures and atomic operations, >> are byte ones better than int? >> >> int main () >> { >> int i; >> char x = 32; >> for (i=0; i<1000000000; i++) >> x = 11 * x + 5 + x / 3; >> return x; >> } > > On Intel Atom for example division latencies are as follows[1]: > > latency throughput > IDIV r/m8 33 32 > IDIV r/m16 42 41 > IDIV r/m32 57 56 > IDIV r/m64 197 196 > > I'd expect the ratio of elapsed times for the corresponding data widths > and a manual division algorithm used with processors that have no hardware > divider to be similar. There is no latency difference between individual > data widths AFAICT for multiplication or general ALU operations. > > For processors that do have a hardware divider implementing word > calculation only I'd expect either a constant latency or again, a decrease > in operation time depending on the actual width of significant data > contained in operands. > > For example the M14Kc MIPS32 processor has an RTL configuration option to > include either an area-efficient or a high-performance MDU > (multiply/divide unit). The area-efficient MDU has a latency of 33 clocks > for unsigned division (signed division adds up to 2 clocks for sign > reversal). The high-performance MDU reduces the latency as follows[2]: > > "Divide operations are implemented with a simple 1-bit-per-clock iterative > algorithm. An early-in detection checks the sign extension of the > dividend (rs) operand. If rs is 8 bits wide, 23 iterations are skipped. > For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs, > 7 iterations are skipped. Any attempt to issue a subsequent MDU > instruction while a divide is still active causes an IU pipeline stall > until the divide operation has completed." > > As it happens automatically there is no benefit from using a narrower data > type, and the lack of subword arithmetic operations means that using such > a type will require a truncation operation from time to time for > multiplication or general ALU operations. > >> --- a/sysdeps/generic/stdint.h >> +++ b/sysdeps/generic/stdint.h >> @@ -87,12 +87,13 @@ typedef unsigned long long int uint_least64_t; >> /* Fast types. */ >> >> /* Signed. */ >> -typedef signed char int_fast8_t; >> #if __WORDSIZE == 64 >> +typedef long int int_fast8_t; >> typedef long int int_fast16_t; >> typedef long int int_fast32_t; >> typedef long int int_fast64_t; >> #else >> +typedef int int_fast8_t; >> typedef int int_fast16_t; >> typedef int int_fast32_t; >> __extension__ > > So I find the choice of types above to be already questionable for a > generic header. By default I'd expect fast data types to have the same > width as their fixed-width counterparts for the large benefit they provide > with most architectures that do implement subword arithmetic weighed > against the small loss they will likely incur with architectures that only > implement word arithmetic. Then individual ports could override the > defaults as they see fit. > > At this point the discussion is however I believe moot though -- there > will have been relocatable objects out there with data embedded using > these types already so the ABI has been set and I don't see a way of > changing it without breaking binary compatibility. > > References: > > [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", > Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 > "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21 This manual is very old and Intel Atom Microarchitecture has been replaced by Silvermont Microarchitecture. > [2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS > Technologies, Inc., Document Number: MD00672, November 2, 2009, > Subsection "High-Performance MDU", p. 6 > > Maciej
On Mon, 9 Feb 2015, H.J. Lu wrote: > > [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", > > Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 > > "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21 > > This manual is very old and Intel Atom Microarchitecture has been > replaced by Silvermont Microarchitecture. Maybe, I picked whatever I had handy. That doesn't change the fact this silicon is out there. There's older yet x86 silicon in the field where the divider has a similar characteristic. As I say for a generic header you need to weigh the balance -- except that the ABI has been set, so there's little if anything we can do, apart from new architecture additions that can get their overrides right from the beginning. Maciej
On Sun, 8 Feb 2015, Ondřej Bílka wrote: > Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? > > It is definitely slower with division, following code is 25% slower on > haswell with char than when you use long. Well, there are trade-offs with space in that if you use more cache for a larger type then that could make things slower; it's not automatically the case that a type that's faster for individual operations makes programs faster if it's used. In any case, we need to be very wary of changing the size or alignment or C++ name mangling of any exported type in a public header, even if it doesn't affect existing executables / shared libraries because the type isn't involved in the ABIs of existing glibc functions. In the present case, GCC knows about these types for each target, both so as to ensure <stdint.h> is available for freestanding compilations, and to provide the Fortran C bindings. So if you did use different types for new glibc configurations, the GCC ports for those targets would also need to know. This patch doesn't change inttypes.h. If it passed testing, I suppose that means we are missing a testcase that all the inttypes.h format macros do work with the corresponding types (such a test would mainly be that the testcase builds OK with -Wformat -Werror rather than for the runtime execution).
On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote: > Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? > > It is definitely slower with division, following code is 25% slower on > haswell with char than when you use long. This claim is nonsense. It's a compiler bug. If the 8-bit divide instruction is slow, then the compiler should use 32-bit or 64-bit divide instructions to divide 8-bit types. (Note: there's actually no such thing as a division of 8-byte types; formally, they're promoted to int, so it's the compiler being stupid if it generates a slow 8-bit divide instruction for operands that are formally int!) There's no reason to use a different type for the _storage_. The only legigimate reason to change the int_fast types is when the _storage_ in certain sizes is slow to access. Rich
On 09/02/15 13:36, Maciej W. Rozycki wrote: > On Sun, 8 Feb 2015, Ondřej Bílka wrote: > >> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? >> >> It is definitely slower with division, following code is 25% slower on >> haswell with char than when you use long. > > It may boil down to the choice of instructions produced made by the > compiler. I can hardly imagine 8-bit division to be slower than 64-bit > one on a processor that implements subword integer arithmetic. > >> There is question what about other architectures and atomic operations, >> are byte ones better than int? >> >> int main () >> { >> int i; >> char x = 32; >> for (i=0; i<1000000000; i++) >> x = 11 * x + 5 + x / 3; >> return x; >> } > > On Intel Atom for example division latencies are as follows[1]: > > latency throughput > IDIV r/m8 33 32 > IDIV r/m16 42 41 > IDIV r/m32 57 56 > IDIV r/m64 197 196 > > I'd expect the ratio of elapsed times for the corresponding data widths > and a manual division algorithm used with processors that have no hardware > divider to be similar. There is no latency difference between individual > data widths AFAICT for multiplication or general ALU operations. > > For processors that do have a hardware divider implementing word > calculation only I'd expect either a constant latency or again, a decrease > in operation time depending on the actual width of significant data > contained in operands. > > For example the M14Kc MIPS32 processor has an RTL configuration option to > include either an area-efficient or a high-performance MDU > (multiply/divide unit). The area-efficient MDU has a latency of 33 clocks > for unsigned division (signed division adds up to 2 clocks for sign > reversal). The high-performance MDU reduces the latency as follows[2]: > > "Divide operations are implemented with a simple 1-bit-per-clock iterative > algorithm. An early-in detection checks the sign extension of the > dividend (rs) operand. If rs is 8 bits wide, 23 iterations are skipped. > For a 16-bit-wide rs, 15 iterations are skipped, and for a 24-bit-wide rs, > 7 iterations are skipped. Any attempt to issue a subsequent MDU > instruction while a divide is still active causes an IU pipeline stall > until the divide operation has completed." > > As it happens automatically there is no benefit from using a narrower data > type, and the lack of subword arithmetic operations means that using such > a type will require a truncation operation from time to time for > multiplication or general ALU operations. > >> --- a/sysdeps/generic/stdint.h >> +++ b/sysdeps/generic/stdint.h >> @@ -87,12 +87,13 @@ typedef unsigned long long int uint_least64_t; >> /* Fast types. */ >> >> /* Signed. */ >> -typedef signed char int_fast8_t; >> #if __WORDSIZE == 64 >> +typedef long int int_fast8_t; >> typedef long int int_fast16_t; >> typedef long int int_fast32_t; >> typedef long int int_fast64_t; >> #else >> +typedef int int_fast8_t; >> typedef int int_fast16_t; >> typedef int int_fast32_t; >> __extension__ On AArch64 there's nothing to be gained in terms of performance from using a 64-bit type over a 32-bit type when both can hold the required range of values. In fact, it's likely to make things slower, since multiply and divide operations will most likely take longer. So on AArch64 int_fast8_t, int_fast16_t and int_fast32_t should all map to int, not long. R. > > So I find the choice of types above to be already questionable for a > generic header. By default I'd expect fast data types to have the same > width as their fixed-width counterparts for the large benefit they provide > with most architectures that do implement subword arithmetic weighed > against the small loss they will likely incur with architectures that only > implement word arithmetic. Then individual ports could override the > defaults as they see fit. > > At this point the discussion is however I believe moot though -- there > will have been relocatable objects out there with data embedded using > these types already so the ABI has been set and I don't see a way of > changing it without breaking binary compatibility. > > References: > > [1] "Intel 64 and IA-32 Architectures Optimization Reference Manual", > Intel Corporation, Order Number: 248966-020, November 2009, Table 12-2 > "Intel Atom Microarchitecture Instructions Latency Data", p. 12-21 > > [2] "MIPS32 M14Kc Processor Core Datasheet", Revision 01.00, MIPS > Technologies, Inc., Document Number: MD00672, November 2, 2009, > Subsection "High-Performance MDU", p. 6 > > Maciej >
On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote: > On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote: > > Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? > > > > It is definitely slower with division, following code is 25% slower on > > haswell with char than when you use long. > > This claim is nonsense. It's a compiler bug. If the 8-bit divide > instruction is slow, then the compiler should use 32-bit or 64-bit > divide instructions to divide 8-bit types. (Note: there's actually no > such thing as a division of 8-byte types; formally, they're promoted > to int, so it's the compiler being stupid if it generates a slow 8-bit > divide instruction for operands that are formally int!) There's no > reason to use a different type for the _storage_. > That is also nonsense, you cannot get same speed as 32bit instruction without having 8bit instruction with same performance. Compiler must add extra truncation instructions to get correct result which slows it down, otherwise it gets wrong result for cases like (128+128)%3
On Wed, Feb 11, 2015 at 02:34:09PM +0100, Ondřej Bílka wrote: > On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote: > > On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote: > > > Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? > > > > > > It is definitely slower with division, following code is 25% slower on > > > haswell with char than when you use long. > > > > This claim is nonsense. It's a compiler bug. If the 8-bit divide > > instruction is slow, then the compiler should use 32-bit or 64-bit > > divide instructions to divide 8-bit types. (Note: there's actually no > > such thing as a division of 8-byte types; formally, they're promoted > > to int, so it's the compiler being stupid if it generates a slow 8-bit > > divide instruction for operands that are formally int!) There's no > > reason to use a different type for the _storage_. > > > That is also nonsense, you cannot get same speed as 32bit instruction > without having 8bit instruction with same performance. > > Compiler must add extra truncation instructions to get correct result > which slows it down, otherwise it gets wrong result for cases like (128+128)%3 The additional need to truncate can arise when using 32-bit types on a 64-bit arch, but normally you defer this truncation for until storage (in which case it's usually automatic), promotion, or division (in which case it's hopefully dominated by the time to divide). If your concern is the need to truncate before division/remainder operations, I think that cost is dwarfed by the added cache/memory pressure of wasting half your space. Rich
On 11/02/15 13:34, Ondřej Bílka wrote: > On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote: >> On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote: >>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? >>> >>> It is definitely slower with division, following code is 25% slower on >>> haswell with char than when you use long. >> >> This claim is nonsense. It's a compiler bug. If the 8-bit divide >> instruction is slow, then the compiler should use 32-bit or 64-bit >> divide instructions to divide 8-bit types. (Note: there's actually no >> such thing as a division of 8-byte types; formally, they're promoted >> to int, so it's the compiler being stupid if it generates a slow 8-bit >> divide instruction for operands that are formally int!) There's no >> reason to use a different type for the _storage_. >> > That is also nonsense, you cannot get same speed as 32bit instruction > without having 8bit instruction with same performance. > > Compiler must add extra truncation instructions to get correct result > which slows it down, otherwise it gets wrong result for cases like (128+128)%3 > Only if the intermediate result (128+128) is assigned directly to a variable with less precision than int. Otherwise the whole expression is calculated with int precision. R.
On 11/02/15 18:26, Richard Earnshaw wrote: > On 11/02/15 13:34, Ondřej Bílka wrote: >> On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote: >>> On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote: >>>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? >>>> >>>> It is definitely slower with division, following code is 25% slower on >>>> haswell with char than when you use long. >>> >>> This claim is nonsense. It's a compiler bug. If the 8-bit divide >>> instruction is slow, then the compiler should use 32-bit or 64-bit >>> divide instructions to divide 8-bit types. (Note: there's actually no >>> such thing as a division of 8-byte types; formally, they're promoted >>> to int, so it's the compiler being stupid if it generates a slow 8-bit >>> divide instruction for operands that are formally int!) There's no >>> reason to use a different type for the _storage_. >>> >> That is also nonsense, you cannot get same speed as 32bit instruction >> without having 8bit instruction with same performance. >> >> Compiler must add extra truncation instructions to get correct result >> which slows it down, otherwise it gets wrong result for cases like (128+128)%3 >> > > Only if the intermediate result (128+128) is assigned directly to a > variable with less precision than int. Or, of course, if an explicit cast is applied to the intermediate result. > Otherwise the whole expression > is calculated with int precision. > > R. >
On Wed, Feb 11, 2015 at 06:26:01PM +0000, Richard Earnshaw wrote: > On 11/02/15 13:34, Ondřej Bílka wrote: > > On Mon, Feb 09, 2015 at 01:13:24PM -0500, Rich Felker wrote: > >> On Sun, Feb 08, 2015 at 12:04:26PM +0100, Ondřej Bílka wrote: > >>> Hi, as in bugzilla entry what is rationale of using char as int_fast8_t? > >>> > >>> It is definitely slower with division, following code is 25% slower on > >>> haswell with char than when you use long. > >> > >> This claim is nonsense. It's a compiler bug. If the 8-bit divide > >> instruction is slow, then the compiler should use 32-bit or 64-bit > >> divide instructions to divide 8-bit types. (Note: there's actually no > >> such thing as a division of 8-byte types; formally, they're promoted > >> to int, so it's the compiler being stupid if it generates a slow 8-bit > >> divide instruction for operands that are formally int!) There's no > >> reason to use a different type for the _storage_. > >> > > That is also nonsense, you cannot get same speed as 32bit instruction > > without having 8bit instruction with same performance. > > > > Compiler must add extra truncation instructions to get correct result > > which slows it down, otherwise it gets wrong result for cases like (128+128)%3 > > > > Only if the intermediate result (128+128) is assigned directly to a > variable with less precision than int. Otherwise the whole expression > is calculated with int precision. I read it as an example with small numbers that's not directly applicable with the C type ranges. If we're actually talking about the value 128 in C then of course 128+128 is not subject to truncation. Rich
diff --git a/sysdeps/generic/stdint.h b/sysdeps/generic/stdint.h index 5c9f0ff..c6dda3b 100644 --- a/sysdeps/generic/stdint.h +++ b/sysdeps/generic/stdint.h @@ -87,12 +87,13 @@ typedef unsigned long long int uint_least64_t; /* Fast types. */ /* Signed. */ -typedef signed char int_fast8_t; #if __WORDSIZE == 64 +typedef long int int_fast8_t; typedef long int int_fast16_t; typedef long int int_fast32_t; typedef long int int_fast64_t; #else +typedef int int_fast8_t; typedef int int_fast16_t; typedef int int_fast32_t; __extension__