diff mbox

[1/N] std::regex refactoring - _BracketMatcher

Message ID 20140428124003.GW928@redhat.com
State New
Headers show

Commit Message

Jonathan Wakely April 28, 2014, 12:40 p.m. UTC
Hi,

I've been looking through the regex code and have a few ideas for
simplifications or optimisations that I'd like to share.

This first patch is for _BracketMatcher. We only use std::bitset when
is_same<_CharT, char> so 8 * sizeof(_CharT) should be __CHAR_BIT__
instead. We also only user _UnsignedCharT when is_same<_CharT, char>
so that can just be simplified to unsigned char.

The contents of _BracketMatcher::_M_char_set are not sorted and can
contain duplicates in the current code. Making that a sorted, unique
list in _BracketMatcher::_M_ready() allows a binary search instead of
linear search. This improves worst case performance for pathological
regular expressions like std::wregex('['+std::wstring(1000, 'a')+"b]")
but I'm not sure if it helps in the common case.

Finally, in the non-char case the _CacheT member is an unused empty
object, so having that as the first member requires 7 bytes of
padding. Re-ordering the members reduces the size of a non-char
_BracketMatcher by 8 bytes (but it's still a whopping 96 bytes).

(For a char _BracketMatcher the bitset cache makes it 128 bytes,
this patch doesn't change that).

Thoughts?

Comments

Tim Shen April 28, 2014, 2:15 p.m. UTC | #1
On Mon, Apr 28, 2014 at 8:40 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
> I've been looking through the regex code and have a few ideas for
> simplifications or optimisations that I'd like to share.

Thanks :)

> This first patch is for _BracketMatcher. We only use std::bitset when
> is_same<_CharT, char> so 8 * sizeof(_CharT) should be __CHAR_BIT__
> instead. We also only user _UnsignedCharT when is_same<_CharT, char>
> so that can just be simplified to unsigned char.

Yes, since _UnsignedCharT is just used as indexes, we can always use a
larger unsigned integer instead. Maybe "size_t" is a better choice?

I'm not sure if we'll have a wchar_t cache (bitset<65536>) in the future ;)

> The contents of _BracketMatcher::_M_char_set are not sorted and can
> contain duplicates in the current code. Making that a sorted, unique
> list in _BracketMatcher::_M_ready() allows a binary search instead of
> linear search. This improves worst case performance for pathological
> regular expressions like std::wregex('['+std::wstring(1000, 'a')+"b]")
> but I'm not sure if it helps in the common case.

Trust me, in common case, even it's a bit slower, it won't be
observable. So it's just OK.

> Finally, in the non-char case the _CacheT member is an unused empty
> object, so having that as the first member requires 7 bytes of
> padding. Re-ordering the members reduces the size of a non-char
> _BracketMatcher by 8 bytes (but it's still a whopping 96 bytes).
>
> (For a char _BracketMatcher the bitset cache makes it 128 bytes,
> this patch doesn't change that).

This is OK too.

Rant: I can't understand why the committee doesn't keep things simple
by letting struct { } has 0 size. I used to implement my own tuple by
specializing tuple<> as an empty struct. The thing is, one tuple
shouldn't `has it` as a member, or it will get extra padding; at last
I used inheritance.
Jonathan Wakely April 28, 2014, 3:05 p.m. UTC | #2
On 28/04/14 10:15 -0400, Tim Shen wrote:
>On Mon, Apr 28, 2014 at 8:40 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
>> I've been looking through the regex code and have a few ideas for
>> simplifications or optimisations that I'd like to share.
>
>Thanks :)
>
>> This first patch is for _BracketMatcher. We only use std::bitset when
>> is_same<_CharT, char> so 8 * sizeof(_CharT) should be __CHAR_BIT__
>> instead. We also only user _UnsignedCharT when is_same<_CharT, char>
>> so that can just be simplified to unsigned char.
>
>Yes, since _UnsignedCharT is just used as indexes, we can always use a
>larger unsigned integer instead. Maybe "size_t" is a better choice?

There is a well-defined mapping from every unsigned char in the range
[0,255] to char and back, so conversions between char and unsigned
char are fine.  If we used a larger type then we would get the wrong result
when char is signed, because (size_t)(char)-1 != (unsigned char)(char)-1

>I'm not sure if we'll have a wchar_t cache (bitset<65536>) in the future ;)

sizeof(wchar_t) is 4 on unix-like systems, but a cache for char16_t
wouldn't be totally crazy.

If you prefer I will not remove the uses of _CharT and _UnsignedCharT
and won't assume the cache is only used for char. There are still some
simplifications we can make.

>> The contents of _BracketMatcher::_M_char_set are not sorted and can
>> contain duplicates in the current code. Making that a sorted, unique
>> list in _BracketMatcher::_M_ready() allows a binary search instead of
>> linear search. This improves worst case performance for pathological
>> regular expressions like std::wregex('['+std::wstring(1000, 'a')+"b]")
>> but I'm not sure if it helps in the common case.
>
>Trust me, in common case, even it's a bit slower, it won't be
>observable. So it's just OK.
>
>> Finally, in the non-char case the _CacheT member is an unused empty
>> object, so having that as the first member requires 7 bytes of
>> padding. Re-ordering the members reduces the size of a non-char
>> _BracketMatcher by 8 bytes (but it's still a whopping 96 bytes).
>>
>> (For a char _BracketMatcher the bitset cache makes it 128 bytes,
>> this patch doesn't change that).
>
>This is OK too.

Thanks, I'll clean the patch up and commit it soon.
Tim Shen April 28, 2014, 3:18 p.m. UTC | #3
On Mon, Apr 28, 2014 at 11:05 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
> There is a well-defined mapping from every unsigned char in the range
> [0,255] to char and back, so conversions between char and unsigned
> char are fine.  If we used a larger type then we would get the wrong result
> when char is signed, because (size_t)(char)-1 != (unsigned char)(char)-1

You are right. I've missed this.

> sizeof(wchar_t) is 4 on unix-like systems, but a cache for char16_t
> wouldn't be totally crazy.
>
> If you prefer I will not remove the uses of _CharT and _UnsignedCharT
> and won't assume the cache is only used for char. There are still some
> simplifications we can make.

Yes, I (now again) prefer _UnsignedCharT. But it's just kind of
personal flavor, I suppose?

> Thanks, I'll clean the patch up and commit it soon.

Thanks again :)
diff mbox

Patch

diff --git a/libstdc++-v3/include/bits/regex_compiler.h b/libstdc++-v3/include/bits/regex_compiler.h
index f5a198f..a9dd8d3 100644
--- a/libstdc++-v3/include/bits/regex_compiler.h
+++ b/libstdc++-v3/include/bits/regex_compiler.h
@@ -396,6 +396,9 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       void
       _M_ready()
       {
+	std::sort(_M_char_set.begin(), _M_char_set.end());
+	auto __end = std::unique(_M_char_set.begin(), _M_char_set.end());
+	_M_char_set.erase(__end, _M_char_set.end());
 	_M_make_cache(_IsChar());
 #ifdef _GLIBCXX_DEBUG
 	_M_is_ready = true;
@@ -405,10 +408,10 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
     private:
       typedef typename is_same<_CharT, char>::type _IsChar;
       struct _Dummy { };
+      static constexpr size_t _S_cache_size() { return 1ul << __CHAR_BIT__; }
       typedef typename conditional<_IsChar::value,
-				   std::bitset<1 << (8 * sizeof(_CharT))>,
+				   std::bitset<_S_cache_size()>,
 				   _Dummy>::type _CacheT;
-      typedef typename make_unsigned<_CharT>::type _UnsignedCharT;
 
     private:
       bool
@@ -416,14 +419,13 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
       bool
       _M_apply(_CharT __ch, true_type) const
-      { return _M_cache[static_cast<_UnsignedCharT>(__ch)]; }
+      { return _M_cache[static_cast<unsigned char>(__ch)]; }
 
       void
       _M_make_cache(true_type)
       {
-	for (int __i = 0; __i < _M_cache.size(); __i++)
-	  _M_cache[static_cast<_UnsignedCharT>(__i)] =
-	    _M_apply(__i, false_type());
+	for (unsigned __i = 0; __i < _S_cache_size(); __i++)
+	  _M_cache[__i] = _M_apply(static_cast<char>(__i), false_type());
       }
 
       void
@@ -431,13 +433,13 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       { }
 
     private:
-      _CacheT                                   _M_cache;
       std::vector<_CharT>                       _M_char_set;
       std::vector<_StringT>                     _M_equiv_set;
       std::vector<pair<_StrTransT, _StrTransT>> _M_range_set;
       _CharClassT                               _M_class_set;
       _TransT                                   _M_translator;
       const _TraitsT&                           _M_traits;
+      _CacheT                                   _M_cache;
       bool                                      _M_is_non_matching;
 #ifdef _GLIBCXX_DEBUG
       bool                                      _M_is_ready;
diff --git a/libstdc++-v3/include/bits/regex_compiler.tcc b/libstdc++-v3/include/bits/regex_compiler.tcc
index 128dac1..36edfba 100644
--- a/libstdc++-v3/include/bits/regex_compiler.tcc
+++ b/libstdc++-v3/include/bits/regex_compiler.tcc
@@ -507,12 +507,9 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _BracketMatcher<_TraitsT, __icase, __collate>::
     _M_apply(_CharT __ch, false_type) const
     {
-      bool __ret = false;
-      if (std::find(_M_char_set.begin(), _M_char_set.end(),
-		    _M_translator._M_translate(__ch))
-	  != _M_char_set.end())
-	__ret = true;
-      else
+      bool __ret = std::binary_search(_M_char_set.begin(), _M_char_set.end(),
+				      _M_translator._M_translate(__ch));
+      if (!__ret)
 	{
 	  auto __s = _M_translator._M_transform(__ch);
 	  for (auto& __it : _M_range_set)