Message ID | 20240404153158.313297-1-jwakely@redhat.com |
---|---|
State | New |
Headers | show |
Series | libstdc++: Fix infinite loop in std::istream::ignore(n, delim) [PR93672] | expand |
> On 4 Apr 2024, at 16:29, Jonathan Wakely <jwakely@redhat.com> wrote: > > I would appreciate more eyes on this to confirm my conclusions about > negative int_type values, and the proposed fix, make sense. > > Tested x86_64-linux. > > -- >8 -- > > A negative value for the delim value passed to std::istream::ignore can > never match any character in the stream, because the comparison is done > using traits_type::eq_int_type(sb->sgetc(), delim) and sgetc() never > returns negative values (except at EOF). The optimized version of > ignore for the std::istream specialization uses traits_type::find to > locate the delim character in the streambuf, which _can_ match a > negative delim on platforms where char is signed, but then we do another > comparison using eq_int_type which fails. The code then keeps looping > forever, with traits_type::find saying the character is present and > eq_int_type saying it's not. > > A possible fix would be to check with eq_int_type after a successful > find, to see whether we really have a match. However, that would be > suboptimal since we know that a negative delimiter will never match > using eq_int_type. So a better fix is to adjust the check at the top of > the function that handles delim==eof(), so that we treat all negative > delim values as equivalent to EOF. That way we don't bother using find > to search for something that will never match with eq_int_type. Is the corollary to this that a platform with signed chars can never use a negative value as a delimiter - since that we always be treated as EOF? - I am not sure it there’s an actual use-case where that matters, but, Iain > > The version of ignore in the primary template doesn't need a change, > because it doesn't use traits_type::find, instead characters are > extracted one-by-one and always matched using eq_int_type. That avoids > the inconsistency between find and eq_int_type. > > libstdc++-v3/ChangeLog: > > PR libstdc++/93672 > * src/c++98/istream.cc (istream::ignore(streamsize, int_type)): > Treat all negative delimiter values as eof(). > * testsuite/27_io/basic_istream/ignore/char/93672.cc: New test. > --- > libstdc++-v3/src/c++98/istream.cc | 5 ++++- > .../27_io/basic_istream/ignore/char/93672.cc | 15 +++++++++++++++ > 2 files changed, 19 insertions(+), 1 deletion(-) > create mode 100644 libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc > > diff --git a/libstdc++-v3/src/c++98/istream.cc b/libstdc++-v3/src/c++98/istream.cc > index 07ac739c26a..aa1069dea07 100644 > --- a/libstdc++-v3/src/c++98/istream.cc > +++ b/libstdc++-v3/src/c++98/istream.cc > @@ -112,7 +112,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > basic_istream<char>:: > ignore(streamsize __n, int_type __delim) > { > - if (traits_type::eq_int_type(__delim, traits_type::eof())) > + // sgetc() returns either (int_type)(unsigned char)c or -1 for EOF. > + // If __delim is negative, then eq_int_type(sgetc(), __delim) can only > + // be true for EOF, so just treat all negative values as eof(). > + if (__delim < 0) > return ignore(__n); > > _M_gcount = 0; > diff --git a/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc > new file mode 100644 > index 00000000000..6d11f5622c8 > --- /dev/null > +++ b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc > @@ -0,0 +1,15 @@ > +// { dg-do run } > + > +#include <sstream> > +#include <testsuite_hooks.h> > + > +int main() > +{ > + std::istringstream in("x\xfdxxx\xfex"); > + in.ignore(10, std::char_traits<char>::to_int_type('\xfd')); > + VERIFY( in.gcount() == 2 ); > + VERIFY( ! in.eof() ); > + in.ignore(10, '\xfe'); > + VERIFY( in.gcount() == 5 ); > + VERIFY( in.eof() ); > +} > -- > 2.44.0 >
On Thu, Apr 4, 2024 at 5:29 PM Jonathan Wakely <jwakely@redhat.com> wrote: > I would appreciate more eyes on this to confirm my conclusions about > negative int_type values, and the proposed fix, make sense. The way something like this is handled in glibc's ctype functions is that both branches are considered. For isXXX(c) whether c is -v or 256-v the same value is returned (except for EOF which is -1). This caused the least number of bad surprises. You could here also perform similar actions.
On Thu, 4 Apr 2024 at 16:40, Iain Sandoe <idsandoe@googlemail.com> wrote: > > > > > On 4 Apr 2024, at 16:29, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > I would appreciate more eyes on this to confirm my conclusions about > > negative int_type values, and the proposed fix, make sense. > > > > Tested x86_64-linux. > > > > -- >8 -- > > > > A negative value for the delim value passed to std::istream::ignore can > > never match any character in the stream, because the comparison is done > > using traits_type::eq_int_type(sb->sgetc(), delim) and sgetc() never > > returns negative values (except at EOF). The optimized version of > > ignore for the std::istream specialization uses traits_type::find to > > locate the delim character in the streambuf, which _can_ match a > > negative delim on platforms where char is signed, but then we do another > > comparison using eq_int_type which fails. The code then keeps looping > > forever, with traits_type::find saying the character is present and > > eq_int_type saying it's not. > > > > A possible fix would be to check with eq_int_type after a successful > > find, to see whether we really have a match. However, that would be > > suboptimal since we know that a negative delimiter will never match > > using eq_int_type. So a better fix is to adjust the check at the top of > > the function that handles delim==eof(), so that we treat all negative > > delim values as equivalent to EOF. That way we don't bother using find > > to search for something that will never match with eq_int_type. > > Is the corollary to this that a platform with signed chars can never use a > negative value as a delimiter - since that we always be treated as EOF? That's what the C++ standard says (and is what libc++ does). The delimiter argument to ignore is an int_type, not a char. So formally you should call it like: std::cin.ignore(n, std::istream::traits_type::to_int_type('a')); where to_int_type will cast to unsigned char and then to int, so that no char can ever produce a negative value for that argument. If you happen to know that casting 'a' to unsigned char and then to int doesn't change its value (because it's a 7-bit ASCII value), then you can be lazy and do: std::cin.ignore(n, 'a'); That works fine. But if your delimiter character is the MSB set, *and* char is signed on your platform, then you can't be lazy. The implicit conversion from char to the stream's int_type is not the same as the result of calling traits_type::to_int_type, and so these are NOT equivalent on a platform with signed char: std::cin.ignore(n, '\x80'); std::cin.ignore(n, (unsigned char)'\x80'); The former is wrong, the latter is correct. The former will never match a '\x80' in the stream, because the ignore function will cast each char extracted from the stream to (int)(unsigned char) and so never match -128. So the change to treat all negative values as EOF is just an optimization. Since they can never match, there's no point searching for them. Just skip n chars. > > - I am not sure it there’s an actual use-case where that matters, but, > Iain > > > > > The version of ignore in the primary template doesn't need a change, > > because it doesn't use traits_type::find, instead characters are > > extracted one-by-one and always matched using eq_int_type. That avoids > > the inconsistency between find and eq_int_type. > > > > libstdc++-v3/ChangeLog: > > > > PR libstdc++/93672 > > * src/c++98/istream.cc (istream::ignore(streamsize, int_type)): > > Treat all negative delimiter values as eof(). > > * testsuite/27_io/basic_istream/ignore/char/93672.cc: New test. > > --- > > libstdc++-v3/src/c++98/istream.cc | 5 ++++- > > .../27_io/basic_istream/ignore/char/93672.cc | 15 +++++++++++++++ > > 2 files changed, 19 insertions(+), 1 deletion(-) > > create mode 100644 libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc > > > > diff --git a/libstdc++-v3/src/c++98/istream.cc b/libstdc++-v3/src/c++98/istream.cc > > index 07ac739c26a..aa1069dea07 100644 > > --- a/libstdc++-v3/src/c++98/istream.cc > > +++ b/libstdc++-v3/src/c++98/istream.cc > > @@ -112,7 +112,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > basic_istream<char>:: > > ignore(streamsize __n, int_type __delim) > > { > > - if (traits_type::eq_int_type(__delim, traits_type::eof())) > > + // sgetc() returns either (int_type)(unsigned char)c or -1 for EOF. > > + // If __delim is negative, then eq_int_type(sgetc(), __delim) can only > > + // be true for EOF, so just treat all negative values as eof(). > > + if (__delim < 0) > > return ignore(__n); > > > > _M_gcount = 0; > > diff --git a/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc > > new file mode 100644 > > index 00000000000..6d11f5622c8 > > --- /dev/null > > +++ b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc > > @@ -0,0 +1,15 @@ > > +// { dg-do run } > > + > > +#include <sstream> > > +#include <testsuite_hooks.h> > > + > > +int main() > > +{ > > + std::istringstream in("x\xfdxxx\xfex"); > > + in.ignore(10, std::char_traits<char>::to_int_type('\xfd')); > > + VERIFY( in.gcount() == 2 ); > > + VERIFY( ! in.eof() ); > > + in.ignore(10, '\xfe'); > > + VERIFY( in.gcount() == 5 ); > > + VERIFY( in.eof() ); > > +} > > -- > > 2.44.0 > > >
On Thu, 4 Apr 2024 at 17:29, Ulrich Drepper <drepper.fsp@gmail.com> wrote: > > On Thu, Apr 4, 2024 at 5:29 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > I would appreciate more eyes on this to confirm my conclusions about > > negative int_type values, and the proposed fix, make sense. > > The way something like this is handled in glibc's ctype functions is > that both branches are considered. For isXXX(c) whether c is -v or > 256-v the same value is returned (except for EOF which is -1). This > caused the least number of bad surprises. > > You could here also perform similar actions. Yes, my first attempt to fix PR93672 did exactly that, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93672#c1 But since it doesn't work for '\xff' (because that's EOF when char is signed) it only handles 127 of the 128 possible bugs ;-) I'm also not sure it's conforming, since the standard specifies how the matching is done, and that won't match negative chars.
On Thu, 4 Apr 2024 at 17:33, Jonathan Wakely <jwakely@redhat.com> wrote: > > On Thu, 4 Apr 2024 at 17:29, Ulrich Drepper <drepper.fsp@gmail.com> wrote: > > > > On Thu, Apr 4, 2024 at 5:29 PM Jonathan Wakely <jwakely@redhat.com> wrote: > > > I would appreciate more eyes on this to confirm my conclusions about > > > negative int_type values, and the proposed fix, make sense. > > > > The way something like this is handled in glibc's ctype functions is > > that both branches are considered. For isXXX(c) whether c is -v or > > 256-v the same value is returned (except for EOF which is -1). This > > caused the least number of bad surprises. > > > > You could here also perform similar actions. > > Yes, my first attempt to fix PR93672 did exactly that, see > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93672#c1 > > But since it doesn't work for '\xff' (because that's EOF when char is > signed) it only handles 127 of the 128 possible bugs ;-) > I'm also not sure it's conforming, since the standard specifies how > the matching is done, and that won't match negative chars. I might suggest relaxing the standard to allow it though...
On Thu, 4 Apr 2024 at 17:30, Jonathan Wakely <jwakely@redhat.com> wrote: > > On Thu, 4 Apr 2024 at 16:40, Iain Sandoe <idsandoe@googlemail.com> wrote: > > > > > > > > > On 4 Apr 2024, at 16:29, Jonathan Wakely <jwakely@redhat.com> wrote: > > > > > > I would appreciate more eyes on this to confirm my conclusions about > > > negative int_type values, and the proposed fix, make sense. > > > > > > Tested x86_64-linux. > > > > > > -- >8 -- > > > > > > A negative value for the delim value passed to std::istream::ignore can > > > never match any character in the stream, because the comparison is done > > > using traits_type::eq_int_type(sb->sgetc(), delim) and sgetc() never > > > returns negative values (except at EOF). The optimized version of > > > ignore for the std::istream specialization uses traits_type::find to > > > locate the delim character in the streambuf, which _can_ match a > > > negative delim on platforms where char is signed, but then we do another > > > comparison using eq_int_type which fails. The code then keeps looping > > > forever, with traits_type::find saying the character is present and > > > eq_int_type saying it's not. > > > > > > A possible fix would be to check with eq_int_type after a successful > > > find, to see whether we really have a match. However, that would be > > > suboptimal since we know that a negative delimiter will never match > > > using eq_int_type. So a better fix is to adjust the check at the top of > > > the function that handles delim==eof(), so that we treat all negative > > > delim values as equivalent to EOF. That way we don't bother using find > > > to search for something that will never match with eq_int_type. > > > > Is the corollary to this that a platform with signed chars can never use a > > negative value as a delimiter - since that we always be treated as EOF? > > That's what the C++ standard says (and is what libc++ does). > > The delimiter argument to ignore is an int_type, not a char. So > formally you should call it like: > > std::cin.ignore(n, std::istream::traits_type::to_int_type('a')); > > where to_int_type will cast to unsigned char and then to int, so that > no char can ever produce a negative value for that argument. > > If you happen to know that casting 'a' to unsigned char and then to > int doesn't change its value (because it's a 7-bit ASCII value), then > you can be lazy and do: > > std::cin.ignore(n, 'a'); > > That works fine. > > But if your delimiter character is the MSB set, *and* char is signed > on your platform, then you can't be lazy. The implicit conversion from > char to the stream's int_type is not the same as the result of calling > traits_type::to_int_type, and so these are NOT equivalent on a > platform with signed char: > std::cin.ignore(n, '\x80'); > std::cin.ignore(n, (unsigned char)'\x80'); > > The former is wrong, the latter is correct. > The former will never match a '\x80' in the stream, because the ignore > function will cast each char extracted from the stream to > (int)(unsigned char) and so never match -128. > > So the change to treat all negative values as EOF is just an > optimization. Since they can never match, there's no point searching > for them. Just skip n chars. And FWIW, I don't think libc++ does that optimization. They extract char-by-char and compare them, even though that can never match a negative value. So they get the same result as with my patch, but much slower ;-) The status quo is that libstdc++ loops forever given a negative delimiter. That's obviously wrong! We need to fix the buggy logic that loops forever, and the "correct" standard conforming fix means that we never match the negative value, because the chars are extracted from the stream as non-negative int_type values. As an optimization, we can skip the loop that keeps trying impossible matches, because we know it won't match. That's what my patch does. (It occurs to me now that we could also treat any delim value greater than 255 as EOF too, since that can't match either). Ulrich's suggestion is to allow the buggy user code to Just Work (for all cases except (char)-1 on a platform where char is signed). That is maybe the least surprising behaviour for users. On the other hand, I'm not sure we really want to support: cin.ignore(n, -10); cin.ignore(n, -999); cin.ignore(n, +999); What are these trying to do? Those are just nonsense arguments to this function. But if we try to make the testcase in the bug report Just Work, we end up also accepting (at least) the -10 case, because it's indistinguishable from the char '\xf6'. Depending how we do it, we might also make the other cases work, treating -999 as '\x19', and +999 as '\xe7'.
On Thu, 4 Apr 2024 at 17:55, Jonathan Wakely <jwakely@redhat.com> wrote: > > On Thu, 4 Apr 2024 at 17:30, Jonathan Wakely <jwakely@redhat.com> wrote: > Ulrich's suggestion is to allow the buggy user code to Just Work (for > all cases except (char)-1 on a platform where char is signed). That is > maybe the least surprising behaviour for users. > > On the other hand, I'm not sure we really want to support: > > cin.ignore(n, -10); > cin.ignore(n, -999); > cin.ignore(n, +999); > > What are these trying to do? Those are just nonsense arguments to this > function. But if we try to make the testcase in the bug report Just > Work, we end up also accepting (at least) the -10 case, because it's > indistinguishable from the char '\xf6'. Depending how we do it, we > might also make the other cases work, treating -999 as '\x19', and > +999 as '\xe7'. So maybe what we want is to add a new overload: basic_istream& ignore(streamsize n, char_type c) { return ignore(n, traits_type::to_int_type(c)); } This will only accept the stream's char_type, not -999 or +999, and will do the required conversion to int_type correctly, not as an implicit conversion. Calls that pass eof() will still select the other overload and do the right thing. Calls using a char (or for wstream, a wchar_t) will select the new overload. This new overload will make some calls ambiguous, e.g. ignore(1, 1L) compiles today, but would become ambiguous. That seems fine, since the second argument should not be type long (what is it even trying to do?) If that's a problem, we can make it a constrained template that only gets called for the exact char_type: basic_istream& ignore(streamsize n, same_as<char_type> auto c) This would still Do The Right Thing for ignore(n, '\x80') but would not change the result of ignore(1, 1L) which would still select the original overload taking int_type for the second parameter. I think I'll raise this idea with the committee. For now though, I think my patch fixes the bug and conforms to the standard, and doesn't do anything inventive.
diff --git a/libstdc++-v3/src/c++98/istream.cc b/libstdc++-v3/src/c++98/istream.cc index 07ac739c26a..aa1069dea07 100644 --- a/libstdc++-v3/src/c++98/istream.cc +++ b/libstdc++-v3/src/c++98/istream.cc @@ -112,7 +112,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION basic_istream<char>:: ignore(streamsize __n, int_type __delim) { - if (traits_type::eq_int_type(__delim, traits_type::eof())) + // sgetc() returns either (int_type)(unsigned char)c or -1 for EOF. + // If __delim is negative, then eq_int_type(sgetc(), __delim) can only + // be true for EOF, so just treat all negative values as eof(). + if (__delim < 0) return ignore(__n); _M_gcount = 0; diff --git a/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc new file mode 100644 index 00000000000..6d11f5622c8 --- /dev/null +++ b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc @@ -0,0 +1,15 @@ +// { dg-do run } + +#include <sstream> +#include <testsuite_hooks.h> + +int main() +{ + std::istringstream in("x\xfdxxx\xfex"); + in.ignore(10, std::char_traits<char>::to_int_type('\xfd')); + VERIFY( in.gcount() == 2 ); + VERIFY( ! in.eof() ); + in.ignore(10, '\xfe'); + VERIFY( in.gcount() == 5 ); + VERIFY( in.eof() ); +}