Message ID | 1420799253-21911-1-git-send-email-frediano.ziglio@huawei.com |
---|---|
State | New |
Headers | show |
On 09/01/2015 11:27, Frediano Ziglio wrote: > > Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com> > --- > include/qemu-common.h | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/include/qemu-common.h b/include/qemu-common.h > index f862214..5366220 100644 > --- a/include/qemu-common.h > +++ b/include/qemu-common.h > @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val) > } > > /* compute with 96 bit intermediate result: (a*b)/c */ > +#ifndef __x86_64__ > static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) > { > union { > @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) > res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c; > return res.ll; > } > +#else > +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) > +{ > + uint64_t res; > + > + asm ("mulq %2\n\tdivq %3" > + : "=a"(res) > + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c) > + : "rdx", "cc"); > + return res; > +} > +#endif > Good idea. However, if you have __int128, you can just do return (__int128)a * b / c and the compiler should generate the right code. Conveniently, there is already CONFIG_INT128 that you can use. Paolo
2015-01-09 10:35 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>: > > > On 09/01/2015 11:27, Frediano Ziglio wrote: >> >> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com> >> --- >> include/qemu-common.h | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/include/qemu-common.h b/include/qemu-common.h >> index f862214..5366220 100644 >> --- a/include/qemu-common.h >> +++ b/include/qemu-common.h >> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val) >> } >> >> /* compute with 96 bit intermediate result: (a*b)/c */ >> +#ifndef __x86_64__ >> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >> { >> union { >> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c; >> return res.ll; >> } >> +#else >> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >> +{ >> + uint64_t res; >> + >> + asm ("mulq %2\n\tdivq %3" >> + : "=a"(res) >> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c) >> + : "rdx", "cc"); >> + return res; >> +} >> +#endif >> > > Good idea. However, if you have __int128, you can just do > > return (__int128)a * b / c > > and the compiler should generate the right code. Conveniently, there is > already CONFIG_INT128 that you can use. > > Paolo Well, it works but in our case b <= c, that is a * b / c is always < 2^64. This lead to no integer overflow in the last division. However the compiler does not know this so it does the entire (a*b) / c division which is mainly consists in two integer division instead of one (not taking into account that is implemented using a helper function). I think that I'll write two patches. One implementing using the int128 as you suggested (which is much easier to read that current one and assembly ones) that another for x86_64 optimization. Frediano
On 09/01/2015 12:04, Frediano Ziglio wrote: > 2015-01-09 10:35 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>: >> >> >> On 09/01/2015 11:27, Frediano Ziglio wrote: >>> >>> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com> >>> --- >>> include/qemu-common.h | 13 +++++++++++++ >>> 1 file changed, 13 insertions(+) >>> >>> diff --git a/include/qemu-common.h b/include/qemu-common.h >>> index f862214..5366220 100644 >>> --- a/include/qemu-common.h >>> +++ b/include/qemu-common.h >>> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val) >>> } >>> >>> /* compute with 96 bit intermediate result: (a*b)/c */ >>> +#ifndef __x86_64__ >>> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >>> { >>> union { >>> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >>> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c; >>> return res.ll; >>> } >>> +#else >>> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >>> +{ >>> + uint64_t res; >>> + >>> + asm ("mulq %2\n\tdivq %3" >>> + : "=a"(res) >>> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c) >>> + : "rdx", "cc"); >>> + return res; >>> +} >>> +#endif >>> >> >> Good idea. However, if you have __int128, you can just do >> >> return (__int128)a * b / c >> >> and the compiler should generate the right code. Conveniently, there is >> already CONFIG_INT128 that you can use. > > Well, it works but in our case b <= c, that is a * b / c is always < > 2^64. This is not necessarily the case. Quick grep: hw/timer/hpet.c: return (muldiv64(value, HPET_CLK_PERIOD, FS_PER_NS)); hw/timer/hpet.c: return (muldiv64(value, FS_PER_NS, HPET_CLK_PERIOD)); One of the two must disprove your assertion. :) But it's true that we expect no overflow. > This lead to no integer overflow in the last division. However > the compiler does not know this so it does the entire (a*b) / c > division which is mainly consists in two integer division instead of > one (not taking into account that is implemented using a helper > function). > > I think that I'll write two patches. One implementing using the int128 > as you suggested (which is much easier to read that current one and > assembly ones) that another for x86_64 optimization. Right, that's even better. Out of curiosity, have you seen it in some profiles? Paolo
On 9 January 2015 at 11:24, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 09/01/2015 12:04, Frediano Ziglio wrote: >> I think that I'll write two patches. One implementing using the int128 >> as you suggested (which is much easier to read that current one and >> assembly ones) that another for x86_64 optimization. > > Right, that's even better. Personally I would prefer we didn't write inline assembly functions if we can avoid them. So I'd rather see an int128 version, and if the compiler doesn't do a good enough job then go talk to the compiler folks to improve things. > Out of curiosity, have you seen it in some profiles? I would absolutely want to see significant perf uplift on a real workload before we start putting inline asm into qemu-common.h... -- PMM
2015-01-09 11:24 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>: > > > On 09/01/2015 12:04, Frediano Ziglio wrote: >> 2015-01-09 10:35 GMT+00:00 Paolo Bonzini <pbonzini@redhat.com>: >>> >>> >>> On 09/01/2015 11:27, Frediano Ziglio wrote: >>>> >>>> Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com> >>>> --- >>>> include/qemu-common.h | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/include/qemu-common.h b/include/qemu-common.h >>>> index f862214..5366220 100644 >>>> --- a/include/qemu-common.h >>>> +++ b/include/qemu-common.h >>>> @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val) >>>> } >>>> >>>> /* compute with 96 bit intermediate result: (a*b)/c */ >>>> +#ifndef __x86_64__ >>>> static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >>>> { >>>> union { >>>> @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >>>> res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c; >>>> return res.ll; >>>> } >>>> +#else >>>> +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) >>>> +{ >>>> + uint64_t res; >>>> + >>>> + asm ("mulq %2\n\tdivq %3" >>>> + : "=a"(res) >>>> + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c) >>>> + : "rdx", "cc"); >>>> + return res; >>>> +} >>>> +#endif >>>> >>> >>> Good idea. However, if you have __int128, you can just do >>> >>> return (__int128)a * b / c >>> >>> and the compiler should generate the right code. Conveniently, there is >>> already CONFIG_INT128 that you can use. >> >> Well, it works but in our case b <= c, that is a * b / c is always < >> 2^64. > > This is not necessarily the case. Quick grep: > > hw/timer/hpet.c: return (muldiv64(value, HPET_CLK_PERIOD, FS_PER_NS)); > hw/timer/hpet.c: return (muldiv64(value, FS_PER_NS, HPET_CLK_PERIOD)); > > One of the two must disprove your assertion. :) > Unless FS_PER_NS == HPET_CLK_PERIOD :) > But it's true that we expect no overflow. > This is enough! >> This lead to no integer overflow in the last division. However >> the compiler does not know this so it does the entire (a*b) / c >> division which is mainly consists in two integer division instead of >> one (not taking into account that is implemented using a helper >> function). >> >> I think that I'll write two patches. One implementing using the int128 >> as you suggested (which is much easier to read that current one and >> assembly ones) that another for x86_64 optimization. > > Right, that's even better. > > Out of curiosity, have you seen it in some profiles? > > Paolo No, just looked at the complicate code and generated code and though "why using dozen of instructions if two are enough?" :) Frediano
diff --git a/include/qemu-common.h b/include/qemu-common.h index f862214..5366220 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -370,6 +370,7 @@ static inline uint8_t from_bcd(uint8_t val) } /* compute with 96 bit intermediate result: (a*b)/c */ +#ifndef __x86_64__ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) { union { @@ -392,6 +393,18 @@ static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c; return res.ll; } +#else +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) +{ + uint64_t res; + + asm ("mulq %2\n\tdivq %3" + : "=a"(res) + : "a"(a), "qm"((uint64_t) b), "qm"((uint64_t)c) + : "rdx", "cc"); + return res; +} +#endif /* Round number down to multiple */ #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
As this platform can do multiply/divide using 128 bit precision use these instruction to implement it. Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com> --- include/qemu-common.h | 13 +++++++++++++ 1 file changed, 13 insertions(+)