Message ID | 201312200112.rBK1CIxx029481@ignucius.se.axis.com |
---|---|
State | New |
Headers | show |
On 20 December 2013 01:12, Hans-Peter Nilsson wrote: > Here's a patch that splits up 20_util/hash/chi2_quality.cc *and* > increases some of the iteration numbers for simulator targets to > something that passes for all working targets mentioned below. > I am a bit worried about the stability of these tests and the > implementation, seeing this amount of target-specific > differences in results. Maybe a person interested in the > quality of the implementation and knowing statistics should have > a look. I think Matt Austern at Google contributed those hash quality tests originally. > Ok to commit? OK, thanks.
Many thanks for your effort in fixing the issue. I can confirm that the new tests pass on arm-eabi using qemu as the simulator. Thanks, Yufeng P.s. Wish you have nice holiday break and happy new year! On 12/20/13 01:12, Hans-Peter Nilsson wrote: > Here's a patch that splits up 20_util/hash/chi2_quality.cc *and* > increases some of the iteration numbers for simulator targets to > something that passes for all working targets mentioned below. > I am a bit worried about the stability of these tests and the > implementation, seeing this amount of target-specific > differences in results. Maybe a person interested in the > quality of the implementation and knowing statistics should have > a look. > > I originally naively thought just splitting up the test would > help; allowing to remove the simulator target-test-constraints > completely, but that only produced timeouts for cris-elf. > > The effect of this patch is therefore to split it up and > increase some of the SAMPLES values compared to the the current > commit, assuming there's a useful linear dependence. Don't be > fooled by test_document_words now being excluded at the > test-top: it already was (for all SAMPLES< 100000), except for > compiling and running an empty function; see the original. > > I've tested this on two different x86_64-linux-gnu hosts, Fedora > 17 and Debian 7 aka. "wheezy" (to eliminate my suspicion of > distro differences) and three simulator targets: cris-elf, > powerpc-eabi and mipsisa32r2el-elf. I also tried running these > tests for arm-eabi / arm-sim.exp but they all failed apparently > because of memory resource constraints within the simulator > setup: in libstdc++.log I see "terminate called after throwing > an instance of 'std::bad_alloc'". > > I felt I honestly had to increase some of the SAMPLES numbers > from the current number of 30000; the final number I came up > with, passes for all the mentioned working targets. I didn't > want to *decrease* any of the numbers (e.g. for simulator only) > to exploit some local minima of the "k" values (for example > there's one such for 10000 for all targets above except > x86_64/32 and the ones the ARM people mentioned). So, I > increased the respective number to be higher than where at least > one target showed a test-suite failure. When checking the > SAMPLES numbers for the hosts, I of course just hacked the > default SAMPLES temporarily; they were not tested as simulator > targets. The SAMPLES number for all but test_bit_flip_set change > to 35000; a number somewhat arbitrarily chosen as higher than > 30000. > > Curiously, I never saw test_bit_flip_set fail as mentioned in > <http://gcc.gnu.org/ml/libstdc++/2013-10/msg00253.html>. Maybe > that was a mistaken observation and what was observed was > test_bit_string_set failing. (That's a good reason to always > copy-paste instead of *typing* from memory or from reading.) > That test certainly failed for most targets, but also for > SAMPLES=30000, not mentioned in the referenced message, which > made me suspect a distribution- or glibc-version-related > difference. > > A higher SAMPLES number definitely has a noticeable cost in > terms of test-time, challenging the statement "it doesn't take > that much time either". For both cris-elf-sim and > powerpc-eabi-sim running on a x86_64 host of yesteryear, the > time for the affected tests went from about 30 seconds to 4 min > 28 seconds and 6 min 20 seconds respectively, going from 10000 > to these numbers. For the original r205810 change compared to > r205803, test-time for cris-elf for a *complete "make > check"-run* (C, C++, ObjC, Fortran including libraries) went > from 3h56min to 4h5min (when the test timed out). > > Ok to commit? > > libstdc++-v3: > * testsuite/20_util/hash/chi2_quality.h: Break out from > chi2_quality.cc. > * testsuite/20_util/hash/chi2_q_bit_flip_set.cc: Ditto. > * testsuite/20_util/hash/chi2_q_document_words.cc: Ditto. > * testsuite/20_util/hash/chi2_q_bit_string_set.cc: Ditto. Increase > SAMPLES to 35000 for simulator targets. > * testsuite/20_util/hash/chi2_q_numeric_pattern_set.cc: Ditto. > * testsuite/20_util/hash/chi2_q_uniform_random.cc: Ditto. > * testsuite/20_util/hash/chi2_quality.cc: Remove. > > --- libstdc++-v3/testsuite/20_util/hash/chi2_quality.cc Sun Dec 15 15:01:43 2013 > +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 > @@ -1,218 +0,0 @@ > -// { dg-options "-std=gnu++0x" } > - > -// Use smaller statistics when running on simulators, so it takes less time. > -// { dg-options "-std=gnu++0x -DSAMPLES=30000" { target simulator } } > - > -// Copyright (C) 2010-2013 Free Software Foundation, Inc. > -// > -// This file is part of the GNU ISO C++ Library. This library is free > -// software; you can redistribute it and/or modify it under the > -// terms of the GNU General Public License as published by the > -// Free Software Foundation; either version 3, or (at your option) > -// any later version. > -// > -// This library is distributed in the hope that it will be useful, > -// but WITHOUT ANY WARRANTY; without even the implied warranty of > -// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > -// GNU General Public License for more details. > -// > -// You should have received a copy of the GNU General Public License > -// along with this library; see the file COPYING3. If not see > -//<http://www.gnu.org/licenses/>. > - > -// This file uses the chi^2 test to measure the quality of a hash > -// function, by computing the uniformity with which it distributes a set > -// of N strings into k buckets (where k is significantly greater than N). > -// > -// Each bucket has B[i] strings in it. The expected value of each bucket > -// for a uniform distribution is z = N/k, so > -// chi^2 = Sum_i (B[i] - z)^2 / z. > -// > -// We check whether chi^2 is small enough to be consistent with the > -// hypothesis of a uniform distribution. If F(chi^2, k-1) is close to > -// 0 (where F is the cumulative probability distribution), we can > -// reject that hypothesis. So we don't want F to be too small, which > -// for large k, means we want chi^2 to be not too much larger than k. > -// > -// We use the chi^2 test for several sets of strings. Any non-horrible > -// hash function should do well with purely random strings. A really > -// good hash function will also do well with more structured sets, > -// including ones where the strings differ by only a few bits. > - > -#include<algorithm> > -#include<cstdlib> > -#include<cstdio> > -#include<fstream> > -#include<functional> > -#include<iostream> > -#include<iterator> > -#include<string> > -#include<unordered_set> > -#include<vector> > -#include<testsuite_hooks.h> > - > -#ifndef SAMPLES > -#define SAMPLES 300000 > -#endif > - > -template<typename Container> > - double > - chi2_hash(const Container& c, long buckets) > - { > - std::vector<int> counts(buckets); > - std::hash<std::string> hasher; > - double elements = 0; > - for (auto i = c.begin(); i != c.end(); ++i) > - { > - ++counts[hasher(*i) % buckets]; > - ++elements; > - } > - > - const double z = elements / buckets; > - double sum = 0; > - for (long i = 0; i< buckets; ++i) > - { > - double delta = counts[i] - z; > - sum += delta*delta; > - } > - return sum/z; > - } > - > -// Tests chi^2 for a distribution of uniformly generated random strings. > -void > -test_uniform_random() > -{ > - bool test __attribute__((unused)) = true; > - std::srand(137); > - std::unordered_set<std::string> set; > - std::string s; > - const unsigned long N = SAMPLES; > - const unsigned long k = N/100; > - const unsigned int len = 25; > - while (set.size()< N) > - { > - s.clear(); > - for (unsigned int i = 0; i< len; ++i) > - s.push_back(rand() % 128); > - set.insert(s); > - } > - > - double chi2 = chi2_hash(set, k); > - VERIFY( chi2< k*1.1 ); > -} > - > -// Tests chi^2 for a distribution of strings that differ from each > -// other by only a few bits. We start with an arbitrary base string, and > -// flip three random bits for each member of the set. > -void > -test_bit_flip_set() > -{ > - bool test __attribute__((unused)) = true; > - const unsigned long N = SAMPLES; > - const unsigned long k = N/100; > - const unsigned int len = 67; > - const unsigned int bitlen = len * 8; > - const unsigned int bits_to_flip = 3; > - const char base[len+1] = "abcdefghijklmnopqrstuvwxyz" > - "ABCDEFGHIJKLMNOPQRSTUVWXYZ" > - "0123456789!@#$%"; > - > - std::unordered_set<std::string> set; > - while (set.size()< N) > - { > - std::string s(base, base+len); > - for (unsigned int i = 0; i< bits_to_flip; ++i) > - { > - int bit = rand() % bitlen; > - s[bit/8] ^= (1<< (bit%8)); > - } > - set.insert(s); > - } > - > - double chi2 = chi2_hash(set, k); > - VERIFY( chi2< k*1.1 ); > -} > - > -// Tests chi^2 of a set of strings that all have a similar pattern, > -// intended to mimic some sort of ID string. > -void > -test_numeric_pattern_set() > -{ > - bool test __attribute__((unused)) = true; > - const unsigned long N = SAMPLES; > - const unsigned long k = N/100; > - std::vector<std::string> set; > - for (unsigned long i = 0; i< N; ++i) > - { > - long i1 = i % 100000; > - long i2 = i / 100000; > - char buf[16]; > - std::sprintf(buf, "XX-%05lu-%05lu", i1, i2); > - set.push_back(buf); > - } > - > - double chi2 = chi2_hash(set, k); > - VERIFY( chi2< k*1.1 ); > -} > - > -// Tests chi^2 for a set of strings that all consist of '1' and '0'. > -void > -test_bit_string_set() > -{ > - bool test __attribute__((unused)) = true; > - const unsigned long N = SAMPLES; > - const unsigned long k = N/100; > - std::vector<std::string> set; > - std::string s; > - for (unsigned long i = 0; i< N; ++i) > - { > - s.clear(); > - for (unsigned int j = 0; j< sizeof(unsigned long) * 8; ++j) > - { > - const bool bit = (1UL<< j)& i; > - s.push_back(bit ? '1' : '0'); > - } > - set.push_back(s); > - } > - > - double chi2 = chi2_hash(set, k); > - VERIFY( chi2< k*1.1 ); > -} > - > -// Tests chi^2 for a set of words taken from a document written in English. > -void > -test_document_words() > -{ > - // That file is 187587 single-word lines. To avoid a timeout, just skip > - // this part, which would take up to 95% of the program runtime (with > - // SAMPLES == 10000), if we're not supposed to run anywhere that long. > -#if SAMPLES>= 100000 > - bool test __attribute__((unused)) = true; > - const std::string f_name = "thirty_years_among_the_dead_preproc.txt"; > - std::ifstream in(f_name); > - VERIFY( in.is_open() ); > - std::vector<std::string> words; > - words.assign(std::istream_iterator<std::string>(in), > - std::istream_iterator<std::string>()); > - VERIFY( words.size()> 100000 ); > - std::sort(words.begin(), words.end()); > - auto it = std::unique(words.begin(), words.end()); > - words.erase(it, words.end()); > - VERIFY( words.size()> 5000 ); > - > - const unsigned long k = words.size() / 20; > - double chi2 = chi2_hash(words, k); > - VERIFY( chi2< k*1.1 ); > -#endif > -} > - > -int > -main() > -{ > - test_uniform_random(); > - test_bit_flip_set(); > - test_numeric_pattern_set(); > - test_bit_string_set(); > - test_document_words(); > - return 0; > -} > --- /dev/null Tue Oct 29 15:57:07 2002 > +++ libstdc++-v3/testsuite/20_util/hash/chi2_quality.h Tue Dec 17 08:14:06 2013 > @@ -0,0 +1,74 @@ > +// Copyright (C) 2010-2013 Free Software Foundation, Inc. > +// > +// This file is part of the GNU ISO C++ Library. This library is free > +// software; you can redistribute it and/or modify it under the > +// terms of the GNU General Public License as published by the > +// Free Software Foundation; either version 3, or (at your option) > +// any later version. > +// > +// This library is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with this library; see the file COPYING3. If not see > +//<http://www.gnu.org/licenses/>. > + > +// This file uses the chi^2 test to measure the quality of a hash > +// function, by computing the uniformity with which it distributes a set > +// of N strings into k buckets (where k is significantly greater than N). > +// > +// Each bucket has B[i] strings in it. The expected value of each bucket > +// for a uniform distribution is z = N/k, so > +// chi^2 = Sum_i (B[i] - z)^2 / z. > +// > +// We check whether chi^2 is small enough to be consistent with the > +// hypothesis of a uniform distribution. If F(chi^2, k-1) is close to > +// 0 (where F is the cumulative probability distribution), we can > +// reject that hypothesis. So we don't want F to be too small, which > +// for large k, means we want chi^2 to be not too much larger than k. > +// > +// We use the chi^2 test for several sets of strings. Any non-horrible > +// hash function should do well with purely random strings. A really > +// good hash function will also do well with more structured sets, > +// including ones where the strings differ by only a few bits. > + > +#include<algorithm> > +#include<cstdlib> > +#include<cstdio> > +#include<fstream> > +#include<functional> > +#include<iostream> > +#include<iterator> > +#include<string> > +#include<unordered_set> > +#include<vector> > +#include<testsuite_hooks.h> > + > +#ifndef SAMPLES > +#define SAMPLES 300000 > +#endif > + > +template<typename Container> > + double > + chi2_hash(const Container& c, long buckets) > + { > + std::vector<int> counts(buckets); > + std::hash<std::string> hasher; > + double elements = 0; > + for (auto i = c.begin(); i != c.end(); ++i) > + { > + ++counts[hasher(*i) % buckets]; > + ++elements; > + } > + > + const double z = elements / buckets; > + double sum = 0; > + for (long i = 0; i< buckets; ++i) > + { > + double delta = counts[i] - z; > + sum += delta*delta; > + } > + return sum/z; > + } > --- /dev/null Tue Oct 29 15:57:07 2002 > +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_bit_flip_set.cc Tue Dec 17 08:18:53 2013 > @@ -0,0 +1,61 @@ > +// { dg-options "-std=gnu++0x" } > +// Use smaller statistics when running on simulators, so it takes less time. > +// { dg-options "-std=gnu++0x -DSAMPLES=30000" { target simulator } } > + > +// Copyright (C) 2010-2013 Free Software Foundation, Inc. > +// > +// This file is part of the GNU ISO C++ Library. This library is free > +// software; you can redistribute it and/or modify it under the > +// terms of the GNU General Public License as published by the > +// Free Software Foundation; either version 3, or (at your option) > +// any later version. > +// > +// This library is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with this library; see the file COPYING3. If not see > +//<http://www.gnu.org/licenses/>. > + > +#include "chi2_quality.h" > + > +// Tests chi^2 for a distribution of strings that differ from each > +// other by only a few bits. We start with an arbitrary base string, and > +// flip three random bits for each member of the set. > +void > +test_bit_flip_set() > +{ > + bool test __attribute__((unused)) = true; > + const unsigned long N = SAMPLES; > + const unsigned long k = N/100; > + const unsigned int len = 67; > + const unsigned int bitlen = len * 8; > + const unsigned int bits_to_flip = 3; > + const char base[len+1] = "abcdefghijklmnopqrstuvwxyz" > + "ABCDEFGHIJKLMNOPQRSTUVWXYZ" > + "0123456789!@#$%"; > + > + std::unordered_set<std::string> set; > + while (set.size()< N) > + { > + std::string s(base, base+len); > + for (unsigned int i = 0; i< bits_to_flip; ++i) > + { > + int bit = rand() % bitlen; > + s[bit/8] ^= (1<< (bit%8)); > + } > + set.insert(s); > + } > + > + double chi2 = chi2_hash(set, k); > + VERIFY( chi2< k*1.1 ); > +} > + > +int > +main() > +{ > + test_bit_flip_set(); > + return 0; > +} > --- /dev/null Tue Oct 29 15:57:07 2002 > +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_bit_string_set.cc Tue Dec 17 08:23:42 2013 > @@ -0,0 +1,55 @@ > +// { dg-options "-std=gnu++0x" } > +// Use smaller statistics when running on simulators, so it takes less time. > +// For e.g. cris-elf, mipsisa32r2el-elf, powerpc-eabi and i386-linux-gnu, > +// this test fails for SAMPLES=30000. > +// { dg-options "-std=gnu++0x -DSAMPLES=35000" { target simulator } } > + > +// Copyright (C) 2010-2013 Free Software Foundation, Inc. > +// > +// This file is part of the GNU ISO C++ Library. This library is free > +// software; you can redistribute it and/or modify it under the > +// terms of the GNU General Public License as published by the > +// Free Software Foundation; either version 3, or (at your option) > +// any later version. > +// > +// This library is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with this library; see the file COPYING3. If not see > +//<http://www.gnu.org/licenses/>. > + > +#include "chi2_quality.h" > + > +// Tests chi^2 for a set of strings that all consist of '1' and '0'. > +void > +test_bit_string_set() > +{ > + bool test __attribute__((unused)) = true; > + const unsigned long N = SAMPLES; > + const unsigned long k = N/100; > + std::vector<std::string> set; > + std::string s; > + for (unsigned long i = 0; i< N; ++i) > + { > + s.clear(); > + for (unsigned int j = 0; j< sizeof(unsigned long) * 8; ++j) > + { > + const bool bit = (1UL<< j)& i; > + s.push_back(bit ? '1' : '0'); > + } > + set.push_back(s); > + } > + > + double chi2 = chi2_hash(set, k); > + VERIFY( chi2< k*1.1 ); > +} > + > +int > +main() > +{ > + test_bit_string_set(); > + return 0; > +} > --- /dev/null Tue Oct 29 15:57:07 2002 > +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_document_words.cc Tue Dec 17 08:38:03 2013 > @@ -0,0 +1,52 @@ > +// On some simulators, the workload is simply too large with values big > +// enough for the test to pass the quality test, so just skip it altogether. > +// { dg-do run { target { ! simulator } } } > +// { dg-options "-std=gnu++0x" } > + > +// Copyright (C) 2010-2013 Free Software Foundation, Inc. > +// > +// This file is part of the GNU ISO C++ Library. This library is free > +// software; you can redistribute it and/or modify it under the > +// terms of the GNU General Public License as published by the > +// Free Software Foundation; either version 3, or (at your option) > +// any later version. > +// > +// This library is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with this library; see the file COPYING3. If not see > +//<http://www.gnu.org/licenses/>. > + > +#include "chi2_quality.h" > + > +// Tests chi^2 for a set of words taken from a document written in English. > +void > +test_document_words() > +{ > + bool test __attribute__((unused)) = true; > + const std::string f_name = "thirty_years_among_the_dead_preproc.txt"; > + std::ifstream in(f_name); > + VERIFY( in.is_open() ); > + std::vector<std::string> words; > + words.assign(std::istream_iterator<std::string>(in), > + std::istream_iterator<std::string>()); > + VERIFY( words.size()> 100000 ); > + std::sort(words.begin(), words.end()); > + auto it = std::unique(words.begin(), words.end()); > + words.erase(it, words.end()); > + VERIFY( words.size()> 5000 ); > + > + const unsigned long k = words.size() / 20; > + double chi2 = chi2_hash(words, k); > + VERIFY( chi2< k*1.1 ); > +} > + > +int > +main() > +{ > + test_document_words(); > + return 0; > +} > --- /dev/null Tue Oct 29 15:57:07 2002 > +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_numeric_pattern_set.cc Tue Dec 17 08:20:52 2013 > @@ -0,0 +1,52 @@ > +// { dg-options "-std=gnu++0x" } > +// Use smaller statistics when running on simulators, so it takes less time. > +// For x86_64-linux-gnu SAMPLES=30000 fails, so increase slightly. > +// { dg-options "-std=gnu++0x -DSAMPLES=35000" { target simulator } } > + > +// Copyright (C) 2010-2013 Free Software Foundation, Inc. > +// > +// This file is part of the GNU ISO C++ Library. This library is free > +// software; you can redistribute it and/or modify it under the > +// terms of the GNU General Public License as published by the > +// Free Software Foundation; either version 3, or (at your option) > +// any later version. > +// > +// This library is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with this library; see the file COPYING3. If not see > +//<http://www.gnu.org/licenses/>. > + > +#include "chi2_quality.h" > + > +// Tests chi^2 of a set of strings that all have a similar pattern, > +// intended to mimic some sort of ID string. > +void > +test_numeric_pattern_set() > +{ > + bool test __attribute__((unused)) = true; > + const unsigned long N = SAMPLES; > + const unsigned long k = N/100; > + std::vector<std::string> set; > + for (unsigned long i = 0; i< N; ++i) > + { > + long i1 = i % 100000; > + long i2 = i / 100000; > + char buf[16]; > + std::sprintf(buf, "XX-%05lu-%05lu", i1, i2); > + set.push_back(buf); > + } > + > + double chi2 = chi2_hash(set, k); > + VERIFY( chi2< k*1.1 ); > +} > + > +int > +main() > +{ > + test_numeric_pattern_set(); > + return 0; > +} > --- /dev/null Tue Oct 29 15:57:07 2002 > +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_uniform_random.cc Tue Dec 17 08:16:42 2013 > @@ -0,0 +1,53 @@ > +// { dg-options "-std=gnu++0x" } > +// Use smaller statistics when running on simulators, so it takes less time. > +// For powerpc-eabi, SAMPLES=30000 fails. > +// { dg-options "-std=gnu++0x -DSAMPLES=35000" { target simulator } } > + > +// Copyright (C) 2010-2013 Free Software Foundation, Inc. > +// > +// This file is part of the GNU ISO C++ Library. This library is free > +// software; you can redistribute it and/or modify it under the > +// terms of the GNU General Public License as published by the > +// Free Software Foundation; either version 3, or (at your option) > +// any later version. > +// > +// This library is distributed in the hope that it will be useful, > +// but WITHOUT ANY WARRANTY; without even the implied warranty of > +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +// GNU General Public License for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with this library; see the file COPYING3. If not see > +//<http://www.gnu.org/licenses/>. > + > +#include "chi2_quality.h" > + > +// Tests chi^2 for a distribution of uniformly generated random strings. > +void > +test_uniform_random() > +{ > + bool test __attribute__((unused)) = true; > + std::srand(137); > + std::unordered_set<std::string> set; > + std::string s; > + const unsigned long N = SAMPLES; > + const unsigned long k = N/100; > + const unsigned int len = 25; > + while (set.size()< N) > + { > + s.clear(); > + for (unsigned int i = 0; i< len; ++i) > + s.push_back(rand() % 128); > + set.insert(s); > + } > + > + double chi2 = chi2_hash(set, k); > + VERIFY( chi2< k*1.1 ); > +} > + > +int > +main() > +{ > + test_uniform_random(); > + return 0; > +} > > brgds, H-P > PS. Happy Holidays! >
--- libstdc++-v3/testsuite/20_util/hash/chi2_quality.cc Sun Dec 15 15:01:43 2013 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,218 +0,0 @@ -// { dg-options "-std=gnu++0x" } - -// Use smaller statistics when running on simulators, so it takes less time. -// { dg-options "-std=gnu++0x -DSAMPLES=30000" { target simulator } } - -// Copyright (C) 2010-2013 Free Software Foundation, Inc. -// -// This file is part of the GNU ISO C++ Library. This library is free -// software; you can redistribute it and/or modify it under the -// terms of the GNU General Public License as published by the -// Free Software Foundation; either version 3, or (at your option) -// any later version. -// -// This library is distributed in the hope that it will be useful, -// but WITHOUT ANY WARRANTY; without even the implied warranty of -// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -// GNU General Public License for more details. -// -// You should have received a copy of the GNU General Public License -// along with this library; see the file COPYING3. If not see -// <http://www.gnu.org/licenses/>. - -// This file uses the chi^2 test to measure the quality of a hash -// function, by computing the uniformity with which it distributes a set -// of N strings into k buckets (where k is significantly greater than N). -// -// Each bucket has B[i] strings in it. The expected value of each bucket -// for a uniform distribution is z = N/k, so -// chi^2 = Sum_i (B[i] - z)^2 / z. -// -// We check whether chi^2 is small enough to be consistent with the -// hypothesis of a uniform distribution. If F(chi^2, k-1) is close to -// 0 (where F is the cumulative probability distribution), we can -// reject that hypothesis. So we don't want F to be too small, which -// for large k, means we want chi^2 to be not too much larger than k. -// -// We use the chi^2 test for several sets of strings. Any non-horrible -// hash function should do well with purely random strings. A really -// good hash function will also do well with more structured sets, -// including ones where the strings differ by only a few bits. - -#include <algorithm> -#include <cstdlib> -#include <cstdio> -#include <fstream> -#include <functional> -#include <iostream> -#include <iterator> -#include <string> -#include <unordered_set> -#include <vector> -#include <testsuite_hooks.h> - -#ifndef SAMPLES -#define SAMPLES 300000 -#endif - -template <typename Container> - double - chi2_hash(const Container& c, long buckets) - { - std::vector<int> counts(buckets); - std::hash<std::string> hasher; - double elements = 0; - for (auto i = c.begin(); i != c.end(); ++i) - { - ++counts[hasher(*i) % buckets]; - ++elements; - } - - const double z = elements / buckets; - double sum = 0; - for (long i = 0; i < buckets; ++i) - { - double delta = counts[i] - z; - sum += delta*delta; - } - return sum/z; - } - -// Tests chi^2 for a distribution of uniformly generated random strings. -void -test_uniform_random() -{ - bool test __attribute__((unused)) = true; - std::srand(137); - std::unordered_set<std::string> set; - std::string s; - const unsigned long N = SAMPLES; - const unsigned long k = N/100; - const unsigned int len = 25; - while (set.size() < N) - { - s.clear(); - for (unsigned int i = 0; i < len; ++i) - s.push_back(rand() % 128); - set.insert(s); - } - - double chi2 = chi2_hash(set, k); - VERIFY( chi2 < k*1.1 ); -} - -// Tests chi^2 for a distribution of strings that differ from each -// other by only a few bits. We start with an arbitrary base string, and -// flip three random bits for each member of the set. -void -test_bit_flip_set() -{ - bool test __attribute__((unused)) = true; - const unsigned long N = SAMPLES; - const unsigned long k = N/100; - const unsigned int len = 67; - const unsigned int bitlen = len * 8; - const unsigned int bits_to_flip = 3; - const char base[len+1] = "abcdefghijklmnopqrstuvwxyz" - "ABCDEFGHIJKLMNOPQRSTUVWXYZ" - "0123456789!@#$%"; - - std::unordered_set<std::string> set; - while (set.size() < N) - { - std::string s(base, base+len); - for (unsigned int i = 0; i < bits_to_flip; ++i) - { - int bit = rand() % bitlen; - s[bit/8] ^= (1 << (bit%8)); - } - set.insert(s); - } - - double chi2 = chi2_hash(set, k); - VERIFY( chi2 < k*1.1 ); -} - -// Tests chi^2 of a set of strings that all have a similar pattern, -// intended to mimic some sort of ID string. -void -test_numeric_pattern_set() -{ - bool test __attribute__((unused)) = true; - const unsigned long N = SAMPLES; - const unsigned long k = N/100; - std::vector<std::string> set; - for (unsigned long i = 0; i < N; ++i) - { - long i1 = i % 100000; - long i2 = i / 100000; - char buf[16]; - std::sprintf(buf, "XX-%05lu-%05lu", i1, i2); - set.push_back(buf); - } - - double chi2 = chi2_hash(set, k); - VERIFY( chi2 < k*1.1 ); -} - -// Tests chi^2 for a set of strings that all consist of '1' and '0'. -void -test_bit_string_set() -{ - bool test __attribute__((unused)) = true; - const unsigned long N = SAMPLES; - const unsigned long k = N/100; - std::vector<std::string> set; - std::string s; - for (unsigned long i = 0; i < N; ++i) - { - s.clear(); - for (unsigned int j = 0; j < sizeof(unsigned long) * 8; ++j) - { - const bool bit = (1UL << j) & i; - s.push_back(bit ? '1' : '0'); - } - set.push_back(s); - } - - double chi2 = chi2_hash(set, k); - VERIFY( chi2 < k*1.1 ); -} - -// Tests chi^2 for a set of words taken from a document written in English. -void -test_document_words() -{ - // That file is 187587 single-word lines. To avoid a timeout, just skip - // this part, which would take up to 95% of the program runtime (with - // SAMPLES == 10000), if we're not supposed to run anywhere that long. -#if SAMPLES >= 100000 - bool test __attribute__((unused)) = true; - const std::string f_name = "thirty_years_among_the_dead_preproc.txt"; - std::ifstream in(f_name); - VERIFY( in.is_open() ); - std::vector<std::string> words; - words.assign(std::istream_iterator<std::string>(in), - std::istream_iterator<std::string>()); - VERIFY( words.size() > 100000 ); - std::sort(words.begin(), words.end()); - auto it = std::unique(words.begin(), words.end()); - words.erase(it, words.end()); - VERIFY( words.size() > 5000 ); - - const unsigned long k = words.size() / 20; - double chi2 = chi2_hash(words, k); - VERIFY( chi2 < k*1.1 ); -#endif -} - -int -main() -{ - test_uniform_random(); - test_bit_flip_set(); - test_numeric_pattern_set(); - test_bit_string_set(); - test_document_words(); - return 0; -} --- /dev/null Tue Oct 29 15:57:07 2002 +++ libstdc++-v3/testsuite/20_util/hash/chi2_quality.h Tue Dec 17 08:14:06 2013 @@ -0,0 +1,74 @@ +// Copyright (C) 2010-2013 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. +// +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with this library; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +// This file uses the chi^2 test to measure the quality of a hash +// function, by computing the uniformity with which it distributes a set +// of N strings into k buckets (where k is significantly greater than N). +// +// Each bucket has B[i] strings in it. The expected value of each bucket +// for a uniform distribution is z = N/k, so +// chi^2 = Sum_i (B[i] - z)^2 / z. +// +// We check whether chi^2 is small enough to be consistent with the +// hypothesis of a uniform distribution. If F(chi^2, k-1) is close to +// 0 (where F is the cumulative probability distribution), we can +// reject that hypothesis. So we don't want F to be too small, which +// for large k, means we want chi^2 to be not too much larger than k. +// +// We use the chi^2 test for several sets of strings. Any non-horrible +// hash function should do well with purely random strings. A really +// good hash function will also do well with more structured sets, +// including ones where the strings differ by only a few bits. + +#include <algorithm> +#include <cstdlib> +#include <cstdio> +#include <fstream> +#include <functional> +#include <iostream> +#include <iterator> +#include <string> +#include <unordered_set> +#include <vector> +#include <testsuite_hooks.h> + +#ifndef SAMPLES +#define SAMPLES 300000 +#endif + +template <typename Container> + double + chi2_hash(const Container& c, long buckets) + { + std::vector<int> counts(buckets); + std::hash<std::string> hasher; + double elements = 0; + for (auto i = c.begin(); i != c.end(); ++i) + { + ++counts[hasher(*i) % buckets]; + ++elements; + } + + const double z = elements / buckets; + double sum = 0; + for (long i = 0; i < buckets; ++i) + { + double delta = counts[i] - z; + sum += delta*delta; + } + return sum/z; + } --- /dev/null Tue Oct 29 15:57:07 2002 +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_bit_flip_set.cc Tue Dec 17 08:18:53 2013 @@ -0,0 +1,61 @@ +// { dg-options "-std=gnu++0x" } +// Use smaller statistics when running on simulators, so it takes less time. +// { dg-options "-std=gnu++0x -DSAMPLES=30000" { target simulator } } + +// Copyright (C) 2010-2013 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. +// +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with this library; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +#include "chi2_quality.h" + +// Tests chi^2 for a distribution of strings that differ from each +// other by only a few bits. We start with an arbitrary base string, and +// flip three random bits for each member of the set. +void +test_bit_flip_set() +{ + bool test __attribute__((unused)) = true; + const unsigned long N = SAMPLES; + const unsigned long k = N/100; + const unsigned int len = 67; + const unsigned int bitlen = len * 8; + const unsigned int bits_to_flip = 3; + const char base[len+1] = "abcdefghijklmnopqrstuvwxyz" + "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + "0123456789!@#$%"; + + std::unordered_set<std::string> set; + while (set.size() < N) + { + std::string s(base, base+len); + for (unsigned int i = 0; i < bits_to_flip; ++i) + { + int bit = rand() % bitlen; + s[bit/8] ^= (1 << (bit%8)); + } + set.insert(s); + } + + double chi2 = chi2_hash(set, k); + VERIFY( chi2 < k*1.1 ); +} + +int +main() +{ + test_bit_flip_set(); + return 0; +} --- /dev/null Tue Oct 29 15:57:07 2002 +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_bit_string_set.cc Tue Dec 17 08:23:42 2013 @@ -0,0 +1,55 @@ +// { dg-options "-std=gnu++0x" } +// Use smaller statistics when running on simulators, so it takes less time. +// For e.g. cris-elf, mipsisa32r2el-elf, powerpc-eabi and i386-linux-gnu, +// this test fails for SAMPLES=30000. +// { dg-options "-std=gnu++0x -DSAMPLES=35000" { target simulator } } + +// Copyright (C) 2010-2013 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. +// +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with this library; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +#include "chi2_quality.h" + +// Tests chi^2 for a set of strings that all consist of '1' and '0'. +void +test_bit_string_set() +{ + bool test __attribute__((unused)) = true; + const unsigned long N = SAMPLES; + const unsigned long k = N/100; + std::vector<std::string> set; + std::string s; + for (unsigned long i = 0; i < N; ++i) + { + s.clear(); + for (unsigned int j = 0; j < sizeof(unsigned long) * 8; ++j) + { + const bool bit = (1UL << j) & i; + s.push_back(bit ? '1' : '0'); + } + set.push_back(s); + } + + double chi2 = chi2_hash(set, k); + VERIFY( chi2 < k*1.1 ); +} + +int +main() +{ + test_bit_string_set(); + return 0; +} --- /dev/null Tue Oct 29 15:57:07 2002 +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_document_words.cc Tue Dec 17 08:38:03 2013 @@ -0,0 +1,52 @@ +// On some simulators, the workload is simply too large with values big +// enough for the test to pass the quality test, so just skip it altogether. +// { dg-do run { target { ! simulator } } } +// { dg-options "-std=gnu++0x" } + +// Copyright (C) 2010-2013 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. +// +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with this library; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +#include "chi2_quality.h" + +// Tests chi^2 for a set of words taken from a document written in English. +void +test_document_words() +{ + bool test __attribute__((unused)) = true; + const std::string f_name = "thirty_years_among_the_dead_preproc.txt"; + std::ifstream in(f_name); + VERIFY( in.is_open() ); + std::vector<std::string> words; + words.assign(std::istream_iterator<std::string>(in), + std::istream_iterator<std::string>()); + VERIFY( words.size() > 100000 ); + std::sort(words.begin(), words.end()); + auto it = std::unique(words.begin(), words.end()); + words.erase(it, words.end()); + VERIFY( words.size() > 5000 ); + + const unsigned long k = words.size() / 20; + double chi2 = chi2_hash(words, k); + VERIFY( chi2 < k*1.1 ); +} + +int +main() +{ + test_document_words(); + return 0; +} --- /dev/null Tue Oct 29 15:57:07 2002 +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_numeric_pattern_set.cc Tue Dec 17 08:20:52 2013 @@ -0,0 +1,52 @@ +// { dg-options "-std=gnu++0x" } +// Use smaller statistics when running on simulators, so it takes less time. +// For x86_64-linux-gnu SAMPLES=30000 fails, so increase slightly. +// { dg-options "-std=gnu++0x -DSAMPLES=35000" { target simulator } } + +// Copyright (C) 2010-2013 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. +// +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with this library; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +#include "chi2_quality.h" + +// Tests chi^2 of a set of strings that all have a similar pattern, +// intended to mimic some sort of ID string. +void +test_numeric_pattern_set() +{ + bool test __attribute__((unused)) = true; + const unsigned long N = SAMPLES; + const unsigned long k = N/100; + std::vector<std::string> set; + for (unsigned long i = 0; i < N; ++i) + { + long i1 = i % 100000; + long i2 = i / 100000; + char buf[16]; + std::sprintf(buf, "XX-%05lu-%05lu", i1, i2); + set.push_back(buf); + } + + double chi2 = chi2_hash(set, k); + VERIFY( chi2 < k*1.1 ); +} + +int +main() +{ + test_numeric_pattern_set(); + return 0; +} --- /dev/null Tue Oct 29 15:57:07 2002 +++ libstdc++-v3/testsuite/20_util/hash/chi2_q_uniform_random.cc Tue Dec 17 08:16:42 2013 @@ -0,0 +1,53 @@ +// { dg-options "-std=gnu++0x" } +// Use smaller statistics when running on simulators, so it takes less time. +// For powerpc-eabi, SAMPLES=30000 fails. +// { dg-options "-std=gnu++0x -DSAMPLES=35000" { target simulator } } + +// Copyright (C) 2010-2013 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. +// +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with this library; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +#include "chi2_quality.h" + +// Tests chi^2 for a distribution of uniformly generated random strings. +void +test_uniform_random() +{ + bool test __attribute__((unused)) = true; + std::srand(137); + std::unordered_set<std::string> set; + std::string s; + const unsigned long N = SAMPLES; + const unsigned long k = N/100; + const unsigned int len = 25; + while (set.size() < N) + { + s.clear(); + for (unsigned int i = 0; i < len; ++i) + s.push_back(rand() % 128); + set.insert(s); + } + + double chi2 = chi2_hash(set, k); + VERIFY( chi2 < k*1.1 ); +} + +int +main() +{ + test_uniform_random(); + return 0; +}