Message ID | 20140428124003.GW928@redhat.com |
---|---|
State | New |
Headers | show |
On Mon, Apr 28, 2014 at 8:40 AM, Jonathan Wakely <jwakely@redhat.com> wrote: > I've been looking through the regex code and have a few ideas for > simplifications or optimisations that I'd like to share. Thanks :) > This first patch is for _BracketMatcher. We only use std::bitset when > is_same<_CharT, char> so 8 * sizeof(_CharT) should be __CHAR_BIT__ > instead. We also only user _UnsignedCharT when is_same<_CharT, char> > so that can just be simplified to unsigned char. Yes, since _UnsignedCharT is just used as indexes, we can always use a larger unsigned integer instead. Maybe "size_t" is a better choice? I'm not sure if we'll have a wchar_t cache (bitset<65536>) in the future ;) > The contents of _BracketMatcher::_M_char_set are not sorted and can > contain duplicates in the current code. Making that a sorted, unique > list in _BracketMatcher::_M_ready() allows a binary search instead of > linear search. This improves worst case performance for pathological > regular expressions like std::wregex('['+std::wstring(1000, 'a')+"b]") > but I'm not sure if it helps in the common case. Trust me, in common case, even it's a bit slower, it won't be observable. So it's just OK. > Finally, in the non-char case the _CacheT member is an unused empty > object, so having that as the first member requires 7 bytes of > padding. Re-ordering the members reduces the size of a non-char > _BracketMatcher by 8 bytes (but it's still a whopping 96 bytes). > > (For a char _BracketMatcher the bitset cache makes it 128 bytes, > this patch doesn't change that). This is OK too. Rant: I can't understand why the committee doesn't keep things simple by letting struct { } has 0 size. I used to implement my own tuple by specializing tuple<> as an empty struct. The thing is, one tuple shouldn't `has it` as a member, or it will get extra padding; at last I used inheritance.
On 28/04/14 10:15 -0400, Tim Shen wrote: >On Mon, Apr 28, 2014 at 8:40 AM, Jonathan Wakely <jwakely@redhat.com> wrote: >> I've been looking through the regex code and have a few ideas for >> simplifications or optimisations that I'd like to share. > >Thanks :) > >> This first patch is for _BracketMatcher. We only use std::bitset when >> is_same<_CharT, char> so 8 * sizeof(_CharT) should be __CHAR_BIT__ >> instead. We also only user _UnsignedCharT when is_same<_CharT, char> >> so that can just be simplified to unsigned char. > >Yes, since _UnsignedCharT is just used as indexes, we can always use a >larger unsigned integer instead. Maybe "size_t" is a better choice? There is a well-defined mapping from every unsigned char in the range [0,255] to char and back, so conversions between char and unsigned char are fine. If we used a larger type then we would get the wrong result when char is signed, because (size_t)(char)-1 != (unsigned char)(char)-1 >I'm not sure if we'll have a wchar_t cache (bitset<65536>) in the future ;) sizeof(wchar_t) is 4 on unix-like systems, but a cache for char16_t wouldn't be totally crazy. If you prefer I will not remove the uses of _CharT and _UnsignedCharT and won't assume the cache is only used for char. There are still some simplifications we can make. >> The contents of _BracketMatcher::_M_char_set are not sorted and can >> contain duplicates in the current code. Making that a sorted, unique >> list in _BracketMatcher::_M_ready() allows a binary search instead of >> linear search. This improves worst case performance for pathological >> regular expressions like std::wregex('['+std::wstring(1000, 'a')+"b]") >> but I'm not sure if it helps in the common case. > >Trust me, in common case, even it's a bit slower, it won't be >observable. So it's just OK. > >> Finally, in the non-char case the _CacheT member is an unused empty >> object, so having that as the first member requires 7 bytes of >> padding. Re-ordering the members reduces the size of a non-char >> _BracketMatcher by 8 bytes (but it's still a whopping 96 bytes). >> >> (For a char _BracketMatcher the bitset cache makes it 128 bytes, >> this patch doesn't change that). > >This is OK too. Thanks, I'll clean the patch up and commit it soon.
On Mon, Apr 28, 2014 at 11:05 AM, Jonathan Wakely <jwakely@redhat.com> wrote: > There is a well-defined mapping from every unsigned char in the range > [0,255] to char and back, so conversions between char and unsigned > char are fine. If we used a larger type then we would get the wrong result > when char is signed, because (size_t)(char)-1 != (unsigned char)(char)-1 You are right. I've missed this. > sizeof(wchar_t) is 4 on unix-like systems, but a cache for char16_t > wouldn't be totally crazy. > > If you prefer I will not remove the uses of _CharT and _UnsignedCharT > and won't assume the cache is only used for char. There are still some > simplifications we can make. Yes, I (now again) prefer _UnsignedCharT. But it's just kind of personal flavor, I suppose? > Thanks, I'll clean the patch up and commit it soon. Thanks again :)
diff --git a/libstdc++-v3/include/bits/regex_compiler.h b/libstdc++-v3/include/bits/regex_compiler.h index f5a198f..a9dd8d3 100644 --- a/libstdc++-v3/include/bits/regex_compiler.h +++ b/libstdc++-v3/include/bits/regex_compiler.h @@ -396,6 +396,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION void _M_ready() { + std::sort(_M_char_set.begin(), _M_char_set.end()); + auto __end = std::unique(_M_char_set.begin(), _M_char_set.end()); + _M_char_set.erase(__end, _M_char_set.end()); _M_make_cache(_IsChar()); #ifdef _GLIBCXX_DEBUG _M_is_ready = true; @@ -405,10 +408,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION private: typedef typename is_same<_CharT, char>::type _IsChar; struct _Dummy { }; + static constexpr size_t _S_cache_size() { return 1ul << __CHAR_BIT__; } typedef typename conditional<_IsChar::value, - std::bitset<1 << (8 * sizeof(_CharT))>, + std::bitset<_S_cache_size()>, _Dummy>::type _CacheT; - typedef typename make_unsigned<_CharT>::type _UnsignedCharT; private: bool @@ -416,14 +419,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION bool _M_apply(_CharT __ch, true_type) const - { return _M_cache[static_cast<_UnsignedCharT>(__ch)]; } + { return _M_cache[static_cast<unsigned char>(__ch)]; } void _M_make_cache(true_type) { - for (int __i = 0; __i < _M_cache.size(); __i++) - _M_cache[static_cast<_UnsignedCharT>(__i)] = - _M_apply(__i, false_type()); + for (unsigned __i = 0; __i < _S_cache_size(); __i++) + _M_cache[__i] = _M_apply(static_cast<char>(__i), false_type()); } void @@ -431,13 +433,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION { } private: - _CacheT _M_cache; std::vector<_CharT> _M_char_set; std::vector<_StringT> _M_equiv_set; std::vector<pair<_StrTransT, _StrTransT>> _M_range_set; _CharClassT _M_class_set; _TransT _M_translator; const _TraitsT& _M_traits; + _CacheT _M_cache; bool _M_is_non_matching; #ifdef _GLIBCXX_DEBUG bool _M_is_ready; diff --git a/libstdc++-v3/include/bits/regex_compiler.tcc b/libstdc++-v3/include/bits/regex_compiler.tcc index 128dac1..36edfba 100644 --- a/libstdc++-v3/include/bits/regex_compiler.tcc +++ b/libstdc++-v3/include/bits/regex_compiler.tcc @@ -507,12 +507,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _BracketMatcher<_TraitsT, __icase, __collate>:: _M_apply(_CharT __ch, false_type) const { - bool __ret = false; - if (std::find(_M_char_set.begin(), _M_char_set.end(), - _M_translator._M_translate(__ch)) - != _M_char_set.end()) - __ret = true; - else + bool __ret = std::binary_search(_M_char_set.begin(), _M_char_set.end(), + _M_translator._M_translate(__ch)); + if (!__ret) { auto __s = _M_translator._M_transform(__ch); for (auto& __it : _M_range_set)