Message ID | 20171215042058.23705-1-woodard@redhat.com |
---|---|
State | New |
Headers | show |
Series | New numbers in the benchtests. | expand |
On 12/14/2017 08:20 PM, Ben Woodard wrote: > A customer who has asked not to be named that has had some problems > with performance for some inputs of certain libm functions. We felt > that a good idea would be to capture and preserve these test cases so > that we could track and measure the performance for these cases over > time. These values evidently came from analyzing performance problems > with actual running codes. The intent is that more test cases will be > added over time. > > The fast and the slow cases are separated in the test results so that > we could not only track any improvements in the worst case performance > but also detect any degregation in the performance of the optimal > cases should we change the algorithm or split the range different. This looks good to me, and I've always been suggesting that hardware and software vendors should contribute important numbers upstream so we can see which classes of users are impacted. I'm happy to see this kind of thing go upstream. Some might argue these numbers should stay downstream in RHEL, but I think it does a disservice. With my volunteer hat on I'd like to know how my changes impact all sorts of users. > Here are the results of that part of the modified tests: > > "exp": { > "": { > "duration": 3.10837e+10, > "iterations": 5.654e+06, > "max": 37073.6, > "min": 61.904, > "mean": 5497.65 > }, > "144bits": { > "duration": 2.9752e+10, > "iterations": 1.83e+06, > "max": 36458.4, > "min": 159.793, > "mean": 16257.9 > }, > "768bits": { > "duration": 2.98869e+10, > "iterations": 105000, > "max": 307814, > "min": 277808, > "mean": 284637 > }, > "redhat-slow-customer-cases": { > "duration": 2.90641e+10, > "iterations": 241000, > "max": 173768, > "min": 118393, > "mean": 120598 > }, > "redhat-fast-customer-cases": { > "duration": 2.89308e+10, > "iterations": 1.94562e+08, > "max": 285.431, > "min": 143.328, > "mean": 148.697 > } > }, > > "pow": { > "": { > "duration": 2.91114e+10, > "iterations": 5.9899e+07, > "max": 1340.38, > "min": 149.229, > "mean": 486.009 > }, > "240bits": { > "duration": 3.76374e+10, > "iterations": 400000, > "max": 110951, > "min": 76769.9, > "mean": 94093.4 > }, > "768bits": { > "duration": 1.54488e+11, > "iterations": 101000, > "max": 1.60541e+06, > "min": 742756, > "mean": 1.52959e+06 > }, > "redhat-slow-customer-cases": { > "duration": 3.02998e+10, > "iterations": 42000, > "max": 790097, > "min": 692942, > "mean": 721425 > }, > "redhat-fast-customer-cases": { > "duration": 2.89989e+10, > "iterations": 4.8063e+07, > "max": 1339.97, > "min": 564.295, > "mean": 603.352 > } > }, > > Signed-off-by: Ben Woodard <woodard@redhat.com> Modulo the nits below, you get my Reviewed-by. Reviewed-by: Carlos O'Donell <carlos@redhat.com> > --- > benchtests/exp-inputs | 9 +++++++++ > benchtests/pow-inputs | 12 ++++++++++++ > 2 files changed, 21 insertions(+) > > diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs > index aff3fb42f1..b94e19c4f5 100644 > --- a/benchtests/exp-inputs > +++ b/benchtests/exp-inputs > @@ -587,3 +587,12 @@ > 0x1.0000015853da7p0 > 0x1.0000098e5e007p0 > 0x1.0000099a1ac59p0 > + > +# Contributed based on customer reports > +# Ben Woodard woodard@redhat.com Please use standard '<>' bracing to make this easier to parse in the future. e.g. Contact: Ben Woodard <woodard@redhat.com> > +## name: redhat-slow-customer-cases > +0x1.0p-53 > +## name: redhat-fast-customer-cases > +0x1.0p-52 > +0x1.0p+0 > +0x1.999999999999Ap-4 > diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs > index 78f8ac73d5..cea50edea7 100644 > --- a/benchtests/pow-inputs > +++ b/benchtests/pow-inputs > @@ -509,3 +509,15 @@ > 0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6 > 0x1.fa5c677254961p133, -0x1.c91962524971ep-1 > 0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1 > + > + > +# Contributed based on customer reports > +# Ben Woodard woodard@redhat.com Contact: Ben Woodard <woodard@redhat.com> > +## name: redhat-slow-customer-cases > +0x1.fffffffffffffp-1, 0x1.8p+0 > +0x1.ffffffffffffdp-1, 0x1.8p+0 > +0x1.ffffffffffff7p-1, 0x1.8p+0 > +## name: redhat-fast-customer-cases > +0x1.ffffffffffffep-1, 0x1.8p+0 > +0x1.ffffffffffffcp-1, 0x1.8p+0 > +0x1.999999999999ap-4, 0x1.8p+0 >
On Friday 15 December 2017 09:50 AM, Ben Woodard wrote: > diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs > index aff3fb42f1..b94e19c4f5 100644 > --- a/benchtests/exp-inputs > +++ b/benchtests/exp-inputs > @@ -587,3 +587,12 @@ > 0x1.0000015853da7p0 > 0x1.0000098e5e007p0 > 0x1.0000099a1ac59p0 > + > +# Contributed based on customer reports > +# Ben Woodard woodard@redhat.com > +## name: redhat-slow-customer-cases > +0x1.0p-53 > +## name: redhat-fast-customer-cases > +0x1.0p-52 > +0x1.0p+0 > +0x1.999999999999Ap-4 Paul McGehearty has re-implemented exp which will likely change the nature of these numbers. What it means is that the slow path will be gone for exp. I suggest re-testing with the new implementation because we will probably just end up dropping all of these partitions and you can then just add these inputs as is. > diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs > index 78f8ac73d5..cea50edea7 100644 > --- a/benchtests/pow-inputs > +++ b/benchtests/pow-inputs > @@ -509,3 +509,15 @@ > 0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6 > 0x1.fa5c677254961p133, -0x1.c91962524971ep-1 > 0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1 > + > + > +# Contributed based on customer reports > +# Ben Woodard woodard@redhat.com > +## name: redhat-slow-customer-cases > +0x1.fffffffffffffp-1, 0x1.8p+0 > +0x1.ffffffffffffdp-1, 0x1.8p+0 > +0x1.ffffffffffff7p-1, 0x1.8p+0 > +## name: redhat-fast-customer-cases > +0x1.ffffffffffffep-1, 0x1.8p+0 > +0x1.ffffffffffffcp-1, 0x1.8p+0 > +0x1.999999999999ap-4, 0x1.8p+0 > Why do these inputs need to be in a different namespace? They should go into one of the existing namespaces, i.e. 768bits, 240bits, etc. Their performance relative to the existing namespaces should give you an indication where they fit in. If you want these inputs to look distinct then you could add a comment indicating that these inputs came in from RH; comments are just a single #. Thanks, Siddhesh
On Thu, 14 Dec 2017, Ben Woodard wrote:
> Here are the results of that part of the modified tests:
What do you get with the exp patch (version 8) that has been approved and
now just needs to be committed?
On 12/14/2017 11:43 PM, Siddhesh Poyarekar wrote: > On Friday 15 December 2017 09:50 AM, Ben Woodard wrote: >> diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs >> index aff3fb42f1..b94e19c4f5 100644 >> --- a/benchtests/exp-inputs >> +++ b/benchtests/exp-inputs >> @@ -587,3 +587,12 @@ >> 0x1.0000015853da7p0 >> 0x1.0000098e5e007p0 >> 0x1.0000099a1ac59p0 >> + >> +# Contributed based on customer reports >> +# Ben Woodard woodard@redhat.com >> +## name: redhat-slow-customer-cases >> +0x1.0p-53 >> +## name: redhat-fast-customer-cases >> +0x1.0p-52 >> +0x1.0p+0 >> +0x1.999999999999Ap-4 > > Paul McGehearty has re-implemented exp which will likely change the > nature of these numbers. What it means is that the slow path will be > gone for exp. I suggest re-testing with the new implementation because > we will probably just end up dropping all of these partitions and you > can then just add these inputs as is. That is good, and that will show up when it gets committed as a change in the performance for these named sets. >> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs >> index 78f8ac73d5..cea50edea7 100644 >> --- a/benchtests/pow-inputs >> +++ b/benchtests/pow-inputs >> @@ -509,3 +509,15 @@ >> 0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6 >> 0x1.fa5c677254961p133, -0x1.c91962524971ep-1 >> 0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1 >> + >> + >> +# Contributed based on customer reports >> +# Ben Woodard woodard@redhat.com >> +## name: redhat-slow-customer-cases >> +0x1.fffffffffffffp-1, 0x1.8p+0 >> +0x1.ffffffffffffdp-1, 0x1.8p+0 >> +0x1.ffffffffffff7p-1, 0x1.8p+0 >> +## name: redhat-fast-customer-cases >> +0x1.ffffffffffffep-1, 0x1.8p+0 >> +0x1.ffffffffffffcp-1, 0x1.8p+0 >> +0x1.999999999999ap-4, 0x1.8p+0 >> > > Why do these inputs need to be in a different namespace? They should go > into one of the existing namespaces, i.e. 768bits, 240bits, etc. Their > performance relative to the existing namespaces should give you an > indication where they fit in. If you want these inputs to look distinct > then you could add a comment indicating that these inputs came in from > RH; comments are just a single #. This is a question of outcome and expectations. We have customers for whom certain inputs need to be performant relative to others. If we mix those numbers into the existing namespace then it's hard to see if they got better or worse on their own, since they are lost in the mean. We identify them as things that Red Hat as a downstream wants to keep working well and we will put resources behind that. Does that explain why we don't want to mix these with the other namespaces?
I think that this actually kind of gets at the nature of what these benchtests currently show as opposed to what they could ultimately be used for. Right now, they demonstrate specific numbers that trigger the implementation to go into different modes. With our current implementation, this is the fast path, the slow path and the very slow path. So these functions are there to show how well the current implementation is running when it enters those modes. You can easily see if a code change or a compiler change improves or impacts performance. This is good for optimizing the implementation for performance in the traditional way. However, there are also certain inputs to functions which are known to be problematic or more difficult to round correctly than others and you must compute them out to N-digits to get correct rounding when using a particular implementation. The quality of implementation is in essence how often we hit those hard cases and how quickly we can deal with them. These benchtests that I added are more along those lines. For example, they include exp for the last subnormal and the first normal floating point number. Together these represent a test case which is known by the customer to identify quality of implementation issues. They have indicated to us this case comes up frequently enough that would like it to be tracked and Carlos suggested adding it to the benchtests. When we change the implementation of exp, I still feel like keeping our eye on cases like this as well as the hard to round cases are interesting. We want to make sure things like that don't blow up. That was my intent with this. -ben On Thu, Dec 14, 2017 at 11:43 PM, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: > On Friday 15 December 2017 09:50 AM, Ben Woodard wrote: >> diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs >> index aff3fb42f1..b94e19c4f5 100644 >> --- a/benchtests/exp-inputs >> +++ b/benchtests/exp-inputs >> @@ -587,3 +587,12 @@ >> 0x1.0000015853da7p0 >> 0x1.0000098e5e007p0 >> 0x1.0000099a1ac59p0 >> + >> +# Contributed based on customer reports >> +# Ben Woodard woodard@redhat.com >> +## name: redhat-slow-customer-cases >> +0x1.0p-53 >> +## name: redhat-fast-customer-cases >> +0x1.0p-52 >> +0x1.0p+0 >> +0x1.999999999999Ap-4 > > Paul McGehearty has re-implemented exp which will likely change the > nature of these numbers. What it means is that the slow path will be > gone for exp. I suggest re-testing with the new implementation because > we will probably just end up dropping all of these partitions and you > can then just add these inputs as is. > >> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs >> index 78f8ac73d5..cea50edea7 100644 >> --- a/benchtests/pow-inputs >> +++ b/benchtests/pow-inputs >> @@ -509,3 +509,15 @@ >> 0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6 >> 0x1.fa5c677254961p133, -0x1.c91962524971ep-1 >> 0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1 >> + >> + >> +# Contributed based on customer reports >> +# Ben Woodard woodard@redhat.com >> +## name: redhat-slow-customer-cases >> +0x1.fffffffffffffp-1, 0x1.8p+0 >> +0x1.ffffffffffffdp-1, 0x1.8p+0 >> +0x1.ffffffffffff7p-1, 0x1.8p+0 >> +## name: redhat-fast-customer-cases >> +0x1.ffffffffffffep-1, 0x1.8p+0 >> +0x1.ffffffffffffcp-1, 0x1.8p+0 >> +0x1.999999999999ap-4, 0x1.8p+0 >> > > Why do these inputs need to be in a different namespace? They should go > into one of the existing namespaces, i.e. 768bits, 240bits, etc. Their > performance relative to the existing namespaces should give you an > indication where they fit in. If you want these inputs to look distinct > then you could add a comment indicating that these inputs came in from > RH; comments are just a single #. > > Thanks, > Siddhesh
On 12/15/2017 10:19 AM, Ben Woodard wrote: > I think that this actually kind of gets at the nature of what these > benchtests currently show as opposed to what they could ultimately be > used for. > > Right now, they demonstrate specific numbers that trigger the > implementation to go into different modes. With our current > implementation, this is the fast path, the slow path and the very slow > path. So these functions are there to show how well the current > implementation is running when it enters those modes. You can easily > see if a code change or a compiler change improves or impacts > performance. This is good for optimizing the implementation for > performance in the traditional way. > > However, there are also certain inputs to functions which are known to > be problematic or more difficult to round correctly than others and > you must compute them out to N-digits to get correct rounding when > using a particular implementation. The quality of implementation is in > essence how often we hit those hard cases and how quickly we can deal > with them. These benchtests that I added are more along those lines. > For example, they include exp for the last subnormal and the first > normal floating point number. Together these represent a test case > which is known by the customer to identify quality of implementation > issues. They have indicated to us this case comes up frequently enough > that would like it to be tracked and Carlos suggested adding it to the > benchtests. > > When we change the implementation of exp, I still feel like keeping > our eye on cases like this as well as the hard to round cases are > interesting. We want to make sure things like that don't blow up. That > was my intent with this. Agreed. That's a better explanation than mine :-)
On Friday 15 December 2017 10:36 PM, Carlos O'Donell wrote: > That is good, and that will show up when it gets committed as a change > in the performance for these named sets. Sure, no harm in adding the numbers, I just pointed it out since I reckoned that Ben didn't notice that patch. >>> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs >>> index 78f8ac73d5..cea50edea7 100644 >>> --- a/benchtests/pow-inputs >>> +++ b/benchtests/pow-inputs >>> @@ -509,3 +509,15 @@ >>> 0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6 >>> 0x1.fa5c677254961p133, -0x1.c91962524971ep-1 >>> 0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1 >>> + >>> + >>> +# Contributed based on customer reports >>> +# Ben Woodard woodard@redhat.com >>> +## name: redhat-slow-customer-cases >>> +0x1.fffffffffffffp-1, 0x1.8p+0 >>> +0x1.ffffffffffffdp-1, 0x1.8p+0 >>> +0x1.ffffffffffff7p-1, 0x1.8p+0 >>> +## name: redhat-fast-customer-cases >>> +0x1.ffffffffffffep-1, 0x1.8p+0 >>> +0x1.ffffffffffffcp-1, 0x1.8p+0 >>> +0x1.999999999999ap-4, 0x1.8p+0 >>> >> >> Why do these inputs need to be in a different namespace? They should go >> into one of the existing namespaces, i.e. 768bits, 240bits, etc. Their >> performance relative to the existing namespaces should give you an >> indication where they fit in. If you want these inputs to look distinct >> then you could add a comment indicating that these inputs came in from >> RH; comments are just a single #. > > This is a question of outcome and expectations. > > We have customers for whom certain inputs need to be performant relative > to others. If we mix those numbers into the existing namespace then it's > hard to see if they got better or worse on their own, since they are lost > in the mean. > > We identify them as things that Red Hat as a downstream wants to keep > working well and we will put resources behind that. > > Does that explain why we don't want to mix these with the other namespaces? Sure, but that's more a Red Hat specific use case than an upstream use case. From the upstream perspective we only care about the input and it doesn't get treated any more specially than other inputs (my response to Ben explains why). Maybe one way to deal with this is to upstream this with just a comment at the end of the upstream namespace it belongs and then replace the comment with the ##name tag in a downstream patch for Red Hat/Fedora. Siddhesh
On Friday 15 December 2017 09:34 PM, Joseph Myers wrote: > On Thu, 14 Dec 2017, Ben Woodard wrote: > >> Here are the results of that part of the modified tests: > > What do you get with the exp patch (version 8) that has been approved and > now just needs to be committed? I assume you'll be committing that as reviewer; I don't think Paul has commit access. Siddhesh
On Friday 15 December 2017 11:49 PM, Ben Woodard wrote: > Right now, they demonstrate specific numbers that trigger the > implementation to go into different modes. With our current > implementation, this is the fast path, the slow path and the very slow > path. So these functions are there to show how well the current > implementation is running when it enters those modes. You can easily > see if a code change or a compiler change improves or impacts > performance. This is good for optimizing the implementation for > performance in the traditional way. > > However, there are also certain inputs to functions which are known to > be problematic or more difficult to round correctly than others and > you must compute them out to N-digits to get correct rounding when > using a particular implementation. The quality of implementation is in > essence how often we hit those hard cases and how quickly we can deal > with them. These benchtests that I added are more along those lines. > For example, they include exp for the last subnormal and the first > normal floating point number. Together these represent a test case > which is known by the customer to identify quality of implementation > issues. They have indicated to us this case comes up frequently enough > that would like it to be tracked and Carlos suggested adding it to the > benchtests. You're right in that specific inputs (or rather results) are more important than others due to the number of bits required to ensure a correct rounding regardless of the algorithm used. However, they assume importance from a practical standpoint only if they result in vastly different behaviour in the algorithm. Slow path vs fast path for example makes sense if the algorithm actually has a slow and fast path. In that context, the last subnormal and first normal number are important, but not more important than other inputs that exercise the exact same code path. This is why I suggest merging these inputs into the code path that they exercise since that sort of partitioning scheme makes more sense for upstream code than 'redhat-fast-customer-cases' and 'redhat-slow-customer-cases'. If you want to see those input results differently then a downstream patch that converts the comment into a tag "##name: " is a more appropriate solution. Note that this is different from the concept of workloads that we add to benchmark inputs where the sequence of calls is also important and hence they're executed slightly differently. See for example some CPU2006 inputs that we added to simulate that workload. If your inputs emulate a specific and interesting workload then that's interesting as a separate partition even if it is vendor-specific but it doesn't seem like that. Siddhesh
On Saturday 16 December 2017 10:55 AM, Siddhesh Poyarekar wrote: > Sure, but that's more a Red Hat specific use case than an upstream use > case. From the upstream perspective we only care about the input and it > doesn't get treated any more specially than other inputs (my response to > Ben explains why). Maybe one way to deal with this is to upstream this > with just a comment at the end of the upstream namespace it belongs and > then replace the comment with the ##name tag in a downstream patch for > Red Hat/Fedora. To be clear, I am all for adding the inputs, I am not convinced about the separate vendor-specific partition for it. Vendor-specific partitions only make sense if they are for a specific workload, such as for spec2006.wrf that Wilco added a while back for one of the math functions. Siddhesh
On 12/15/2017 11:31 PM, Siddhesh Poyarekar wrote: > On Friday 15 December 2017 09:34 PM, Joseph Myers wrote: >> On Thu, 14 Dec 2017, Ben Woodard wrote: >> >>> Here are the results of that part of the modified tests: >> What do you get with the exp patch (version 8) that has been approved and >> now just needs to be committed? > I assume you'll be committing that as reviewer; I don't think Paul has > commit access. > > Siddhesh For clarity, the exp() submitter is Patrick McGehearty [not Paul McGehearty :-) ] and I do not (yet) have commit access. - patrick
On Sunday 17 December 2017 01:40 AM, Patrick McGehearty wrote: > For clarity, the exp() submitter is Patrick McGehearty [not Paul > McGehearty :-) ] > and I do not (yet) have commit access. Ugh, I'm so sorry. I guess I kept thinking Paul because it sounded so much like Paul McCartney to me :) Siddhesh
On Sat, 16 Dec 2017, Siddhesh Poyarekar wrote: > You're right in that specific inputs (or rather results) are more > important than others due to the number of bits required to ensure a > correct rounding regardless of the algorithm used. However, they assume And such a correct rounding is not part of the accuracy goals for functions such as exp (only for a few functions such as sqrt and fma that are directly bound to IEEE 754 operations). This does not rule out support for TS 18661-4 reserved function names such as crexp, though that was not part of my TS 18661-4 proposal. If based on exhaustive searches for worst cases for correct rounding, such implementations would not actually need anywhere near the number of bits used by the old exp implementation in glibc.
On Sat, 16 Dec 2017, Patrick McGehearty wrote: > On 12/15/2017 11:31 PM, Siddhesh Poyarekar wrote: > > On Friday 15 December 2017 09:34 PM, Joseph Myers wrote: > > > On Thu, 14 Dec 2017, Ben Woodard wrote: > > > > > > > Here are the results of that part of the modified tests: > > > What do you get with the exp patch (version 8) that has been approved and > > > now just needs to be committed? > > I assume you'll be committing that as reviewer; I don't think Paul has > > commit access. > > > > Siddhesh > For clarity, the exp() submitter is Patrick McGehearty [not Paul McGehearty > :-) ] > and I do not (yet) have commit access. If you'd like the patch committed for you, please provide a GNU ChangeLog entry for it.
On 12/18/2017 11:19 AM, Joseph Myers wrote: > On Sat, 16 Dec 2017, Patrick McGehearty wrote: > >> On 12/15/2017 11:31 PM, Siddhesh Poyarekar wrote: >>> On Friday 15 December 2017 09:34 PM, Joseph Myers wrote: >>>> On Thu, 14 Dec 2017, Ben Woodard wrote: >>>> >>>>> Here are the results of that part of the modified tests: >>>> What do you get with the exp patch (version 8) that has been approved and >>>> now just needs to be committed? >>> I assume you'll be committing that as reviewer; I don't think Paul has >>> commit access. >>> >>> Siddhesh >> For clarity, the exp() submitter is Patrick McGehearty [not Paul McGehearty >> :-) ] >> and I do not (yet) have commit access. > If you'd like the patch committed for you, please provide a GNU ChangeLog > entry for it. > Thank you for the offer. Proposed ChangeLog entry: 2017-12-19 Patrick McGehearty <patrick.mcgehearty@oracle.com> * sysdeps/ieee754/dbl-64/e_exp.c: 5x faster __ieee754_exp() routine. * sysdeps/ieee754/dbl-64/eexp.tbl: New file for e_exp. * sysdeps/ieee754/dbl-64/slowexp.c: Removed. * sysdeps/x86_64/fpu/multiarch/slowexp-avx.c: Removed. * sysdeps/x86_64/fpu/multiarch/slowexp-fma.c: Removed. * sysdeps/x86_64/fpu/multiarch/slowexp-fma4.c: Removed. * sysdeps/generic/math_private.h: Remove slowexp. * sysdeps/ieee754/dbl-64/e_pow.c: Likewise. * sysdeps/powerpc/power4/fpu/Makefile: Likewise. * sysdeps/x86_64/fpu/multiarch/Makefile: Likewise. * sysdeps/x86_64/fpu/multiarch/e_exp-avx.c: Likewise. * sysdeps/x86_64/fpu/multiarch/e_exp-fma.c: Likewise. * sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c: Likewise. * math/Makefile: Likewise. * manual/probes.texi: Remove slowexp documentation.
I have now committed the exp patch. Note for future patches the changes made to the ChangeLog entry. I added removals of sysdeps/i386/fpu/slowexp.c, sysdeps/ia64/fpu/slowexp.c and sysdeps/m68k/m680x0/fpu/slowexp.c, three files overriding the build of slowexp.c for architectures with their own exp implementations, which are no longer needed now nothing builds slowexp.c.
On 19/12/2017 15:29, Joseph Myers wrote: > I have now committed the exp patch. Note for future patches the changes > made to the ChangeLog entry. I added removals of > sysdeps/i386/fpu/slowexp.c, sysdeps/ia64/fpu/slowexp.c and > sysdeps/m68k/m680x0/fpu/slowexp.c, three files overriding the build of > slowexp.c for architectures with their own exp implementations, which are > no longer needed now nothing builds slowexp.c. > I am seeing now on master with GCC 7.1.1: $ cat math/test-double-tgamma.out testing double (without inline functions) Failure: tgamma_upward (-0xb.4ffffp+4): errno set to 0, expected 34 (ERANGE) Failure: tgamma_upward (-0xb.60001p+4): errno set to 0, expected 34 (ERANGE) Failure: tgamma_upward (-0xb.6ffffp+4): errno set to 0, expected 34 (ERANGE) Test suite completed: 1412 test cases plus 1408 tests for exception flags and 1408 tests for errno executed. 3 errors occurred. Is it a known issue?
On Tue, 19 Dec 2017, Adhemerval Zanella wrote: > I am seeing now on master with GCC 7.1.1: > > $ cat math/test-double-tgamma.out > testing double (without inline functions) > Failure: tgamma_upward (-0xb.4ffffp+4): errno set to 0, expected 34 (ERANGE) > Failure: tgamma_upward (-0xb.60001p+4): errno set to 0, expected 34 (ERANGE) > Failure: tgamma_upward (-0xb.6ffffp+4): errno set to 0, expected 34 (ERANGE) > > Test suite completed: > 1412 test cases plus 1408 tests for exception flags and > 1408 tests for errno executed. > 3 errors occurred. > > Is it a known issue? Thanks for pointing this out. I've reverted the exp patch and ulps update. As and when whatever caused those failures is fixed it can go back in.
Here is a possible cause for the failures, not verified as such: The exp implementation was using get_rounding_mode. But tgamma uses SET_RESTORE_ROUND and calls exp within the scope of SET_RESTORE_ROUND. SET_RESTORE_ROUND, for x86_64, only sets the SSE rounding mode, but get_rounding_mode gets the x87 rounding mode. The effect would have been that the code in exp found that the x87 rounding mode was not to-nearest, then redundantly set the SSE rounding mode to to-nearest, then ended up trying to restore the rounding mode and actually setting the SSE rounding mode to the not-to-nearest value of the x87 rounding modes. If this hypothesis is correct, I advise resubmitting the patch in a form that just uses SET_RESTORE_ROUND (FE_TONEAREST) like other libm functions - the performance improvement was sufficient that presumably even this form would still perform better than the existing code. Then, once that is properly validated and checked in, it would be possible to restore the optimization that the use of get_rounding_mode and separate code paths were intended to achieve. To do that, you'd need libc_fegetround{,f,l} which by default just use get_rounding_mode, but on x86 __SSE2_MATH__ do something different for the float and double versions to get the SSE rounding mode instead; exp would use libc_fegetround instead of using get_rounding_mode directly.
On 12/19/2017 12:54 PM, Joseph Myers wrote: > Here is a possible cause for the failures, not verified as such: > > The exp implementation was using get_rounding_mode. But tgamma uses > SET_RESTORE_ROUND and calls exp within the scope of SET_RESTORE_ROUND. > SET_RESTORE_ROUND, for x86_64, only sets the SSE rounding mode, but > get_rounding_mode gets the x87 rounding mode. The effect would have been > that the code in exp found that the x87 rounding mode was not to-nearest, > then redundantly set the SSE rounding mode to to-nearest, then ended up > trying to restore the rounding mode and actually setting the SSE rounding > mode to the not-to-nearest value of the x87 rounding modes. > > If this hypothesis is correct, I advise resubmitting the patch in a form > that just uses SET_RESTORE_ROUND (FE_TONEAREST) like other libm functions > - the performance improvement was sufficient that presumably even this > form would still perform better than the existing code. Then, once that > is properly validated and checked in, it would be possible to restore the > optimization that the use of get_rounding_mode and separate code paths > were intended to achieve. To do that, you'd need libc_fegetround{,f,l} > which by default just use get_rounding_mode, but on x86 __SSE2_MATH__ do > something different for the float and double versions to get the SSE > rounding mode instead; exp would use libc_fegetround instead of using > get_rounding_mode directly. > I did note in the comments (fairly far down) to the patch that "For x86, tgamma showed a few values where the ulp increased to 6 (max ulp for tgamma is 5). Sparc tgamma did not show these failures." As noted tgamma as currently computed is seriously non-linear in its behavior with 1-bit differences in exp() likely to result in many-bit differences in computed values. Nelson Beebe's recent (Aug 2017) book "The Mathematical Function Computation Handbook" (over 1100 pgs) gives an extended discussion of how to reduce the ulp diffs. But that's beyond the scope of the current project. I'll investigate Joseph's hypothesis. If the hypothesis is correct, I find it odd for get_rounding_mode to not have matching behavior with libc_fesetround for x86. - patrick
On Wed, 20 Dec 2017, Patrick McGehearty wrote: > I did note in the comments (fairly far down) to the patch that > "For x86, tgamma showed a few values where the ulp increased to > 6 (max ulp for tgamma is 5). Sparc tgamma did not show these failures." The problem isn't ulps changes, the problem is that there were test failures (missing errno setting in this case) *other than* ones for which a libm-test-ulps regeneration sufficed, and such failures always need more investigation. > If the hypothesis is correct, I find it odd for get_rounding_mode > to not have matching behavior with libc_fesetround for x86. Well, get_rounding_mode has previously only been used (on x86_64 / x86) in interfaces such as strtod / printf (for which it was originally added), for which results when the SSE and x87 rounding modes are different don't matter. Whereas libm functions can be called internally when those modes are different, as part of optimizations that rely on knowing which rounding mode is relevant for computations in a particular type and avoiding changing the other part of the floating-point state. Thus, you need type-specific interfaces in libm (libc_fegetround{,f,l}) whereas you don't in strtod / printf. (Of course this doesn't make any difference for architectures other than x86_64 / x86, because that's the only case where you have two different sets of floating-point state at all.)
On 12/20/2017 11:56 AM, Joseph Myers wrote: > On Wed, 20 Dec 2017, Patrick McGehearty wrote: > >> I did note in the comments (fairly far down) to the patch that >> "For x86, tgamma showed a few values where the ulp increased to >> 6 (max ulp for tgamma is 5). Sparc tgamma did not show these failures." > The problem isn't ulps changes, the problem is that there were test > failures (missing errno setting in this case) *other than* ones for which > a libm-test-ulps regeneration sufficed, and such failures always need more > investigation. > >> If the hypothesis is correct, I find it odd for get_rounding_mode >> to not have matching behavior with libc_fesetround for x86. > Well, get_rounding_mode has previously only been used (on x86_64 / x86) in > interfaces such as strtod / printf (for which it was originally added), > for which results when the SSE and x87 rounding modes are different don't > matter. Whereas libm functions can be called internally when those modes > are different, as part of optimizations that rely on knowing which > rounding mode is relevant for computations in a particular type and > avoiding changing the other part of the floating-point state. Thus, you > need type-specific interfaces in libm (libc_fegetround{,f,l}) whereas you > don't in strtod / printf. (Of course this doesn't make any difference for > architectures other than x86_64 / x86, because that's the only case where > you have two different sets of floating-point state at all.) > Summary of results: Tested replacing get_rounding_mode and libc_fesetround only when not in FE_TONEAREST mode with SET_RESTORE_ROUND (FE_TONEAREST). The tgamma failures disappeared, giving strong support to Joseph's hypothesis. For x86, there is roughly a 5% performance cost which would be tolerable. Unfortunately, for Sparc, there is a 76% performance cost which is less tolerable. When Sparc changes the rounding mode, the instruction pipeline is flushed which has a non-trival cost. I suspect the performance cost will be higher on all platforms which have a high cost for changing the rounding mode. Alternatives: 1) Accept the current behavior. - least work. 2) Just use SET_RESTORE_ROUND for now. - almost no additional work, but slower on some platforms. 3) Define a macro either within e_exp.c or in an include file that selects get_rounding_mode and libc_fesetround for all platforms except x86. It selects SET_RESTORE_ROUND for x86. Putting platform specific macros inside ieee754 branch seems undesirable, but I thought I should mention it as a possibility. 4) Put the SET_RESTORE_ROUND version of e_exp.c as an x86 specific version in the sysdeps tree and the get_rounding_mode version in the ieee754 part of the tree. Some more work when I'm ready to go on winter break. :-) Or, do (2) now and maybe (4) later. Also, as an separate experiment, I tested doubling the size of TBL[] again to use step sizes of 1/128 instead of 1/64. It reduces the error rate from 16.1/10000 to 9.7/10000. It does not remove the 1 bit errors in the current "make check" tests. To fix those will likely require increasing the size of TBL2[]. For my next patch I'm going to go with the larger TBL[] size unless someone objects. Might as well reduce even small errors when we can. - patrick ps Apologies for continuing to accidentally send incomplete messaging while composing. I apparently have an unconscious trained reflex to type ^S whenever I pause for thought. That is beneficial for editors that interpret that as "save file". No so much on mail programs that interpret ^S as "send file". I will try to remember to compose outside of mail in the future and then copy/paste into the mailer.
On Wed, 20 Dec 2017, Patrick McGehearty wrote: > The tgamma failures disappeared, giving strong support to Joseph's hypothesis. > For x86, there is roughly a 5% performance cost which would be tolerable. > Unfortunately, for Sparc, there is a 76% performance cost which is > less tolerable. When Sparc changes the rounding mode, the instruction pipeline But is presumably still better than the existing code (as you mentioned a 5x improvement), so is a reasonable incremental step. > 3) Define a macro either within e_exp.c or in an include file that selects > get_rounding_mode and libc_fesetround for all platforms except x86. > It selects SET_RESTORE_ROUND for x86. > Putting platform specific macros inside ieee754 branch seems > undesirable, but I thought I should mention it as a possibility. The correct thing to do is as I said: add libc_fegetround, libc_fegetroundf and libc_fegetroundl to the large set of math_private.h / fenv_private.h libc_fe* macros. All of these would default to using get_rounding_mode, but sysdeps/i386/fpu/fenv_private.h would, in the __SSE_MATH__ case, use the SSE rounding mode for libc_fegetroundf, and in the __SSE2_MATH__ case use it also for libc_fegetround. Then you could use libc_fegetround where you previously used get_rounding_mode. However, using SET_RESTORE_ROUND as an incremental step still makes sense before adding libc_fegetround* as an improvement on top of that.
On 12/20/2017 5:42 PM, Joseph Myers wrote: > On Wed, 20 Dec 2017, Patrick McGehearty wrote: > >> The tgamma failures disappeared, giving strong support to Joseph's hypothesis. >> For x86, there is roughly a 5% performance cost which would be tolerable. >> Unfortunately, for Sparc, there is a 76% performance cost which is >> less tolerable. When Sparc changes the rounding mode, the instruction pipeline > But is presumably still better than the existing code (as you mentioned a > 5x improvement), so is a reasonable incremental step. > >> 3) Define a macro either within e_exp.c or in an include file that selects >> get_rounding_mode and libc_fesetround for all platforms except x86. >> It selects SET_RESTORE_ROUND for x86. >> Putting platform specific macros inside ieee754 branch seems >> undesirable, but I thought I should mention it as a possibility. > The correct thing to do is as I said: add libc_fegetround, > libc_fegetroundf and libc_fegetroundl to the large set of math_private.h / > fenv_private.h libc_fe* macros. All of these would default to using > get_rounding_mode, but sysdeps/i386/fpu/fenv_private.h would, in the > __SSE_MATH__ case, use the SSE rounding mode for libc_fegetroundf, and in > the __SSE2_MATH__ case use it also for libc_fegetround. Then you could > use libc_fegetround where you previously used get_rounding_mode. > > However, using SET_RESTORE_ROUND as an incremental step still makes sense > before adding libc_fegetround* as an improvement on top of that. > While investigating SET_RESTORE_ROUND, I found two cases which were missing the Rounding mode correction. Fixing those removed the cexp() test case failures. I also increased the table size to 256 from 64 to reduce the 1 ulp failures to less than 1 in 1000 by informal testing. Those remaining failures are likely to be in the range of 0.51 ulp. Investigating those issues delayed my revised submission until now. I agree it would be desirable to add libc_fegetround* where appropriate, but I also agree it is better to get the revised patch in the code base first, then see about the rounding changes. - Patrick McGehearty
diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs index aff3fb42f1..b94e19c4f5 100644 --- a/benchtests/exp-inputs +++ b/benchtests/exp-inputs @@ -587,3 +587,12 @@ 0x1.0000015853da7p0 0x1.0000098e5e007p0 0x1.0000099a1ac59p0 + +# Contributed based on customer reports +# Ben Woodard woodard@redhat.com +## name: redhat-slow-customer-cases +0x1.0p-53 +## name: redhat-fast-customer-cases +0x1.0p-52 +0x1.0p+0 +0x1.999999999999Ap-4 diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs index 78f8ac73d5..cea50edea7 100644 --- a/benchtests/pow-inputs +++ b/benchtests/pow-inputs @@ -509,3 +509,15 @@ 0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6 0x1.fa5c677254961p133, -0x1.c91962524971ep-1 0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1 + + +# Contributed based on customer reports +# Ben Woodard woodard@redhat.com +## name: redhat-slow-customer-cases +0x1.fffffffffffffp-1, 0x1.8p+0 +0x1.ffffffffffffdp-1, 0x1.8p+0 +0x1.ffffffffffff7p-1, 0x1.8p+0 +## name: redhat-fast-customer-cases +0x1.ffffffffffffep-1, 0x1.8p+0 +0x1.ffffffffffffcp-1, 0x1.8p+0 +0x1.999999999999ap-4, 0x1.8p+0
A customer who has asked not to be named that has had some problems with performance for some inputs of certain libm functions. We felt that a good idea would be to capture and preserve these test cases so that we could track and measure the performance for these cases over time. These values evidently came from analyzing performance problems with actual running codes. The intent is that more test cases will be added over time. The fast and the slow cases are separated in the test results so that we could not only track any improvements in the worst case performance but also detect any degregation in the performance of the optimal cases should we change the algorithm or split the range different. Here are the results of that part of the modified tests: "exp": { "": { "duration": 3.10837e+10, "iterations": 5.654e+06, "max": 37073.6, "min": 61.904, "mean": 5497.65 }, "144bits": { "duration": 2.9752e+10, "iterations": 1.83e+06, "max": 36458.4, "min": 159.793, "mean": 16257.9 }, "768bits": { "duration": 2.98869e+10, "iterations": 105000, "max": 307814, "min": 277808, "mean": 284637 }, "redhat-slow-customer-cases": { "duration": 2.90641e+10, "iterations": 241000, "max": 173768, "min": 118393, "mean": 120598 }, "redhat-fast-customer-cases": { "duration": 2.89308e+10, "iterations": 1.94562e+08, "max": 285.431, "min": 143.328, "mean": 148.697 } }, "pow": { "": { "duration": 2.91114e+10, "iterations": 5.9899e+07, "max": 1340.38, "min": 149.229, "mean": 486.009 }, "240bits": { "duration": 3.76374e+10, "iterations": 400000, "max": 110951, "min": 76769.9, "mean": 94093.4 }, "768bits": { "duration": 1.54488e+11, "iterations": 101000, "max": 1.60541e+06, "min": 742756, "mean": 1.52959e+06 }, "redhat-slow-customer-cases": { "duration": 3.02998e+10, "iterations": 42000, "max": 790097, "min": 692942, "mean": 721425 }, "redhat-fast-customer-cases": { "duration": 2.89989e+10, "iterations": 4.8063e+07, "max": 1339.97, "min": 564.295, "mean": 603.352 } }, Signed-off-by: Ben Woodard <woodard@redhat.com> --- benchtests/exp-inputs | 9 +++++++++ benchtests/pow-inputs | 12 ++++++++++++ 2 files changed, 21 insertions(+)