Message ID | 46160747.VyTTuRmQVZ@polaris |
---|---|
State | New |
Headers | show |
Series | [libcpp] Issue a pedantic warning for UCNs outside UCS codespace | expand |
* Eric Botcazou: > the Universal Character Names accepted by the C family of compilers > are mapped to those of ISO/IEC 10646, which defines the Universal > Character Set codespace as the range 0-0x10FFFF inclusive. The > upper bound is already enforced for identifiers but not for > literals, so the following code is accepted in C99: > > #include <stddef.h> > > wchar_t a = L'\U00110000'; > > whereas it is rejected with an error by other compilers (Clang, MSVC). > > I'm not sure whether the compiler is really equired to issue a diagnostic in > this case. Moreover a few tests in the testsuite manipulate UCNs outside the > UCS codespace. That's why I suggest issuing a pedantic warning. > > Tested on x86_64-suse-linux, OK for the mainline? Since this is a pedantic warning … I think this has to depend on the C standards version. I think each C standard needs to be read against the edition of ISO 10646 current at the time of standards approval (the references are sadly not versioned, so the version is implied). Early versions of ISO 10646 definitely do not have the codespace restriction you mention.
> I think this has to depend on the C standards version. I think each C > standard needs to be read against the edition of ISO 10646 current at > the time of standards approval (the references are sadly not > versioned, so the version is implied). Early versions of ISO 10646 > definitely do not have the codespace restriction you mention. Note the already existing hardcoded check in ucn_valid_in_identifier though.
On Tue, 24 Sep 2019, Florian Weimer wrote: > I think this has to depend on the C standards version. I think each C > standard needs to be read against the edition of ISO 10646 current at > the time of standards approval (the references are sadly not > versioned, so the version is implied). Early versions of ISO 10646 > definitely do not have the codespace restriction you mention. Undated references aren't implicitly dated to the version when the standard was published. The ISO/IEC Directives, Part 2 (Principles and rules for the structure and drafting of ISO and IEC documents) (2018 edition, subclause 10.4) <https://www.iso.org/sites/directives/current/part2/index.xhtml#_idTextAnchor122> say: Undated references may be made: * only to a complete document; * if it will be possible to use all future changes of the referenced document for the purposes of the referring document; * when it is understood that the reference will include all amendments to and revisions of the referenced document. I think that's clear that the latest version at the time the standard is used applies (so if the document in the undated normative reference is revised, that effectively changes the requirements of the standad version referencing it).
On Tue, 24 Sep 2019, Eric Botcazou wrote: > Hi, > > the Universal Character Names accepted by the C family of compilers are mapped > to those of ISO/IEC 10646, which defines the Universal Character Set codespace > as the range 0-0x10FFFF inclusive. The upper bound is already enforced for > identifiers but not for literals, so the following code is accepted in C99: > > #include <stddef.h> > > wchar_t a = L'\U00110000'; > > whereas it is rejected with an error by other compilers (Clang, MSVC). > > I'm not sure whether the compiler is really equired to issue a diagnostic in > this case. Moreover a few tests in the testsuite manipulate UCNs outside the > UCS codespace. That's why I suggest issuing a pedantic warning. For C, I think such UCNs violate the Semantics but not the Constraints on UCNs, so no diagnostic is actually required in C, although it is permitted as a pedwarn / error. However, while C++ doesn't have that Semantics / Constraints division, it's also the case that before C++2a, C++ only has a dated normative reference to ISO/IEC 10646-1:1993 (C++2a adds an undated reference and says the dated one is only for deprecated features, as well as explicitly making such UCNs outside the ISO 10646 code point range ill-formed). So I think that for C++, this is only correct as an error / pedwarn in the C++2a case.
On Tue, 24 Sep 2019, Eric Botcazou wrote: > > I think this has to depend on the C standards version. I think each C > > standard needs to be read against the edition of ISO 10646 current at > > the time of standards approval (the references are sadly not > > versioned, so the version is implied). Early versions of ISO 10646 > > definitely do not have the codespace restriction you mention. > > Note the already existing hardcoded check in ucn_valid_in_identifier though. No C or C++ standard version allows characters outside that range in identifiers, so that check is independent of the ISO 10646 version being used.
> For C, I think such UCNs violate the Semantics but not the Constraints on > UCNs, so no diagnostic is actually required in C, although it is permitted > as a pedwarn / error. > > However, while C++ doesn't have that Semantics / Constraints division, > it's also the case that before C++2a, C++ only has a dated normative > reference to ISO/IEC 10646-1:1993 (C++2a adds an undated reference and > says the dated one is only for deprecated features, as well as explicitly > making such UCNs outside the ISO 10646 code point range ill-formed). So I > think that for C++, this is only correct as an error / pedwarn in the > C++2a case. OK, thanks for the exegesis. ;-) Revision version attached. 2019-09-26 Eric Botcazou <ebotcazou@adacore.com> libcpp/ * charset.c (UCS_LIMIT): New macro. (ucn_valid_in_identifier): Use it instead of a hardcoded constant. (_cpp_valid_ucn): Issue a pedantic warning for UCNs larger than UCS_LIMIT outside of identifiers in C and in C++2a. 2019-09-26 Eric Botcazou <ebotcazou@adacore.com> gcc/testsuite/ * gcc.dg/cpp/ucs.c: Add test for new warning and adjust. * gcc.dg/cpp/utf8-5byte-1.c: Add -w to the options. * gcc.dg/attr-alias-5.c: Likewise. * g++.dg/cpp/ucn-1.C: Add test for new warning. * g++.dg/cpp2a/ucn1.C: New test.
On Thu, 26 Sep 2019, Eric Botcazou wrote: > > For C, I think such UCNs violate the Semantics but not the Constraints on > > UCNs, so no diagnostic is actually required in C, although it is permitted > > as a pedwarn / error. > > > > However, while C++ doesn't have that Semantics / Constraints division, > > it's also the case that before C++2a, C++ only has a dated normative > > reference to ISO/IEC 10646-1:1993 (C++2a adds an undated reference and > > says the dated one is only for deprecated features, as well as explicitly > > making such UCNs outside the ISO 10646 code point range ill-formed). So I > > think that for C++, this is only correct as an error / pedwarn in the > > C++2a case. > > OK, thanks for the exegesis. ;-) Revision version attached. Checking "CPP_OPTION (pfile, lang) == CLK_CXX2A" is problematic because future versions later than C++2a should be handled the same as C++2a. The only place I see doing something similar (outside of init.c, most version conditionals are handled via language flags set there) does "CPP_OPTION (pfile, lang) > CLK_CXX11" (for "In C++14 and up these suffixes are in the standard library, so treat them as user-defined literals.", two places doing the same comparison). So I think that "CPP_OPTION (pfile, lang) > CLK_CXX17" is the right thing to replace the comparisons against CLK_CXX2A and CLK_GNUCXX2A. The patch is OK with that change.
Index: libcpp/charset.c =================================================================== --- libcpp/charset.c (revision 275988) +++ libcpp/charset.c (working copy) @@ -901,6 +901,9 @@ struct ucnrange { }; #include "ucnid.h" +/* ISO 10646 defines the UCS codespace as the range 0-0x10FFFF inclusive. */ +#define UCS_LIMIT 0x10FFFF + /* Returns 1 if C is valid in an identifier, 2 if C is valid except at the start of an identifier, and 0 if C is not valid in an identifier. We assume C has already gone through the checks of @@ -915,7 +918,7 @@ ucn_valid_in_identifier (cpp_reader *pfi int mn, mx, md; unsigned short valid_flags, invalid_start_flags; - if (c > 0x10FFFF) + if (c > UCS_LIMIT) return 0; mn = 0; @@ -1016,6 +1019,9 @@ ucn_valid_in_identifier (cpp_reader *pfi whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (`), nor one in the range D800 through DFFF inclusive. + If the hexadecimal value is larger than the upper bound of the UCS + codespace specified in ISO/IEC 10646, a pedantic warning is issued. + *PSTR must be preceded by "\u" or "\U"; it is assumed that the buffer end is delimited by a non-hex digit. Returns false if the UCN has not been consumed, true otherwise. @@ -1135,6 +1141,10 @@ _cpp_valid_ucn (cpp_reader *pfile, const "universal character %.*s is not valid at the start of an identifier", (int) (str - base), base); } + else if (result > UCS_LIMIT) + cpp_error (pfile, CPP_DL_PEDWARN, + "%.*s is outside the UCS codespace", + (int) (str - base), base); *cp = result; return true; Index: gcc/testsuite/gcc.dg/attr-alias-5.c =================================================================== --- gcc/testsuite/gcc.dg/attr-alias-5.c (revision 275988) +++ gcc/testsuite/gcc.dg/attr-alias-5.c (working copy) @@ -1,7 +1,7 @@ /* Verify diagnostics for aliases to strings containing extended identifiers or bad characters. */ /* { dg-do compile } */ -/* { dg-options "-std=gnu99" } */ +/* { dg-options "-std=gnu99 -w" } */ /* { dg-require-alias "" } */ /* { dg-require-ascii-locale "" } */ /* { dg-skip-if "" { powerpc*-*-aix* } } */ Index: gcc/testsuite/gcc.dg/cpp/ucs.c =================================================================== --- gcc/testsuite/gcc.dg/cpp/ucs.c (revision 275988) +++ gcc/testsuite/gcc.dg/cpp/ucs.c (working copy) @@ -39,7 +39,7 @@ #endif #if WCHAR_MAX >= 0x7ffffff -# if L'\U1234abcd' != 0x1234abcd +# if L'\U1234abcd' != 0x1234abcd /* { dg-warning "outside" "" } */ # error bad long ucs /* { dg-bogus "bad" "bad U1234abcd evaluation" } */ # endif #endif @@ -49,7 +49,7 @@ void foo () int c; c = L'\ubad'; /* { dg-error "incomplete" "incomplete UCN 1" } */ - c = L"\U1234"[0]; /* { dg-error "incomplete" "incompete UCN 2" } */ + c = L"\U1234"[0]; /* { dg-error "incomplete" "incomplete UCN 2" } */ c = L'\u000x'; /* { dg-error "incomplete" "non-hex digit in UCN" } */ /* If sizeof(HOST_WIDE_INT) > sizeof(wchar_t), we can get a multi-character @@ -64,4 +64,6 @@ void foo () c = '\u0025'; /* { dg-error "not a valid" "0025 invalid UCN" } */ c = L"\uD800"[0]; /* { dg-error "not a valid" "D800 invalid UCN" } */ c = L'\U0000DFFF'; /* { dg-error "not a valid" "DFFF invalid UCN" } */ + + c = L'\U00110000'; /* { dg-warning "outside" "110000 outside UCS" } */ } Index: gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c =================================================================== --- gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c (revision 275988) +++ gcc/testsuite/gcc.dg/cpp/utf8-5byte-1.c (working copy) @@ -1,7 +1,7 @@ /* Test for bug in conversions from 5-byte UTF-8 sequences in cpplib. */ /* { dg-do run { target { 4byte_wchar_t } } } */ -/* { dg-options "-std=gnu99" } */ +/* { dg-options "-std=gnu99 -w" } */ extern void abort (void); extern void exit (int);