diff mbox series

c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

Message ID Zpg/3AiW41ccEeKL@tucnak
State New
Headers show
Series c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343] | expand

Commit Message

Jakub Jelinek July 17, 2024, 10:04 p.m. UTC
Hi!

The following patch implements the easy parts of the paper.
When @$` are added to the basic character set, it means that
R"@$`()@$`" should now be valid (here I've noticed most of the
raw string tests were tested solely with -std=c++11 or -std=gnu++11
and I've tried to change that), and on the other side even if
by extension $ is allowed in identifiers, \u0024 or \U00000024
or \u{24} should not be, similarly how \u0041 is not allowed.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

The paper in 3.1 claims though that
#include <stdio.h>

#define STR(x) #x

int main()
{
  printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
}
should have been accepted before this paper (and rejected after it),
but g++ rejects it.

I've tried to understand it, but am confused on what is the right
behavior and why.

Consider
#define STR(x) #x
const char *a = "\u00b7";
const char *b = STR(\u00b7);
const char *c = "\u0041";
const char *d = STR(\u0041);
const char *e = STR(a\u00b7);
const char *f = STR(a\u0041);
const char *g = STR(a \u00b7);
const char *h = STR(a \u0041);
const char *i = "\u066d";
const char *j = STR(\u066d);
const char *k = "\u0040";
const char *l = STR(\u0040);
const char *m = STR(a\u066d);
const char *n = STR(a\u0040);
const char *o = STR(a \u066d);
const char *p = STR(a \u0040);

Neither clang nor gcc emit any diagnostics on the a, c, i and k
initializers, those are certainly valid (c is invalid in C23 though).  g++
emits with -pedantic-errors errors on all the others, while clang++ on the
ones with STR involving \u0041, \u0040 and a\u0066d.  The chosen values are
\u0040 '@' as something being changed by this paper, \u0041 'A' as basic
character set char valid in identifiers before/after, \u00b7 as an example
of character which is pedantically valid in identifiers if not at the start
and \u066d s something pedantically not valid in identifiers.

Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
string/character literal which corresponds to basic character set character
(or control character) is ill-formed, that would make d, f, h cases invalid
for C++ and l, n, p cases invalid for C++26.

https://eel.is/c++draft/lex.name states which characters can appear at the
start of the identifier and which can appear after the start.  And
https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
either identifier, or tons of other things, or "each non-whitespace
character that cannot be one of the above"

Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
invalid if the preprocessing token is being converted into token.

And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
the basic character set matches the last category, the program is
ill-formed."

Now, e.g.  for the C++23 STR(\u0040) case, \u0040 is there not in the basic
character set, so valid outside of the literals (not the case anymore in
C++26), but it isn't nondigit and doesn't have XID_Start property, so it
isn't IMHO an identifier and so must be the "each non-whitespace character
that cannot be one of the above" case.  Why doesn't the above mentioned
https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid?  Ignoring
that, I'd say it would be then stringized and that feels like it is what
clang++ is doing.  Now, e.g.  for the STR(a\u066d) case, I wonder why that
isn't lexed as a identifier followed by \u066d "each non-whitespace
character that cannot be one of the above" token and stringified similarly,
clang++ rejects that.

What GCC libcpp seems to be doing is that if that forms_identifier_p calls
_cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
or second+ in identifier, and e.g.  _cpp_valid_ucn then for UCNs valid in
string literals calls
  else if (identifier_pos)
    {
      int validity = ucn_valid_in_identifier (pfile, result, nst);
          
      if (validity == 0)
        cpp_error (pfile, CPP_DL_ERROR,
                   "universal character %.*s is not valid in an identifier",
                   (int) (str - base), base);
      else if (validity == 2 && identifier_pos == 1)
        cpp_error (pfile, CPP_DL_ERROR,
   "universal character %.*s is not valid at the start of an identifier",
                   (int) (str - base), base);
    }
so basically all those invalid in identifiers cases emit an error and
pretend to be valid in identifiers, rather than what e.g.  _cpp_valid_utf8
does for C but not for C++ and only for the chars completely invalid in
identifiers rather than just valid in identifiers but not at the start:
          /* In C++, this is an error for invalid character in an identifier
             because logically, the UTF-8 was converted to a UCN during
             translation phase 1 (even though we don't physically do it that
             way).  In C, this byte rather becomes grammatically a separate
             token.  */
   
          if (CPP_OPTION (pfile, cplusplus))
            cpp_error (pfile, CPP_DL_ERROR,
                       "extended character %.*s is not valid in an identifier",
                       (int) (*pstr - base), base);
          else
            {
              *pstr = base;
              return false;
            }
The comment doesn't really match what is done in recent C++ versions because
there UCNs are translated to characters and not the other way around.

2024-07-17  Jakub Jelinek  <jakub@redhat.com>

	PR c++/110343
libcpp/
	* lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
	(lex_raw_string): For C++26 allow $@` characters in prefix.
	* charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
gcc/testsuite/
	* c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
	remove c++ specific dg-options.
	* c-c++-common/raw-string-2.c: Likewise.
	* c-c++-common/raw-string-4.c: Likewise.
	* c-c++-common/raw-string-5.c: Likewise.  Expect some diagnostics
	only for non-c++26, for c++26 expect different.
	* c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
	remove c++ specific dg-options.
	* c-c++-common/raw-string-11.c: Likewise.
	* c-c++-common/raw-string-13.c: Likewise.
	* c-c++-common/raw-string-14.c: Likewise.
	* c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
	change c++ specific dg-options to just -Wtrigraphs.
	* c-c++-common/raw-string-16.c: Likewise.
	* c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
	remove c++ specific dg-options.
	* c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
	remove -std=c++11 from c++ specific dg-options.
	* c-c++-common/raw-string-19.c: Likewise.
	* g++.dg/cpp26/raw-string1.C: New test.
	* g++.dg/cpp26/raw-string2.C: New test.


	Jakub

Comments

Jason Merrill July 25, 2024, 6:35 p.m. UTC | #1
On 7/17/24 6:04 PM, Jakub Jelinek wrote:
> Hi!
> 
> The following patch implements the easy parts of the paper.
> When @$` are added to the basic character set, it means that
> R"@$`()@$`" should now be valid (here I've noticed most of the
> raw string tests were tested solely with -std=c++11 or -std=gnu++11
> and I've tried to change that), and on the other side even if
> by extension $ is allowed in identifiers, \u0024 or \U00000024
> or \u{24} should not be, similarly how \u0041 is not allowed.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> The paper in 3.1 claims though that
> #include <stdio.h>
> 
> #define STR(x) #x
> 
> int main()
> {
>    printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
> }
> should have been accepted before this paper (and rejected after it),
> but g++ rejects it.
> 
> I've tried to understand it, but am confused on what is the right
> behavior and why.
> 
> Consider
> #define STR(x) #x
> const char *a = "\u00b7";
> const char *b = STR(\u00b7);
> const char *c = "\u0041";
> const char *d = STR(\u0041);
> const char *e = STR(a\u00b7);
> const char *f = STR(a\u0041);
> const char *g = STR(a \u00b7);
> const char *h = STR(a \u0041);
> const char *i = "\u066d";
> const char *j = STR(\u066d);
> const char *k = "\u0040";
> const char *l = STR(\u0040);
> const char *m = STR(a\u066d);
> const char *n = STR(a\u0040);
> const char *o = STR(a \u066d);
> const char *p = STR(a \u0040);
> 
> Neither clang nor gcc emit any diagnostics on the a, c, i and k
> initializers, those are certainly valid (c is invalid in C23 though).  g++
> emits with -pedantic-errors errors on all the others, while clang++ on the
> ones with STR involving \u0041, \u0040 and a\u0066d.  The chosen values are
> \u0040 '@' as something being changed by this paper, \u0041 'A' as basic
> character set char valid in identifiers before/after, \u00b7 as an example
> of character which is pedantically valid in identifiers if not at the start
> and \u066d s something pedantically not valid in identifiers.
> 
> Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
> string/character literal which corresponds to basic character set character
> (or control character) is ill-formed, that would make d, f, h cases invalid
> for C++ and l, n, p cases invalid for C++26.
> 
> https://eel.is/c++draft/lex.name states which characters can appear at the
> start of the identifier and which can appear after the start.  And
> https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
> either identifier, or tons of other things, or "each non-whitespace
> character that cannot be one of the above"
> 
> Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
> invalid if the preprocessing token is being converted into token.
> 
> And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
> the basic character set matches the last category, the program is
> ill-formed."
> 
> Now, e.g.  for the C++23 STR(\u0040) case, \u0040 is there not in the basic
> character set, so valid outside of the literals (not the case anymore in
> C++26), but it isn't nondigit and doesn't have XID_Start property, so it
> isn't IMHO an identifier and so must be the "each non-whitespace character
> that cannot be one of the above" case.  Why doesn't the above mentioned
> https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid?

Your argument makes sense to me, though...

> Ignoring
> that, I'd say it would be then stringized and that feels like it is what
> clang++ is doing.  Now, e.g.  for the STR(a\u066d) case, I wonder why that
> isn't lexed as a identifier followed by \u066d "each non-whitespace
> character that cannot be one of the above" token and stringified similarly,
> clang++ rejects that.
> 
> What GCC libcpp seems to be doing is that if that forms_identifier_p calls
> _cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
> or second+ in identifier, and e.g.  _cpp_valid_ucn then for UCNs valid in
> string literals calls
>    else if (identifier_pos)
>      {
>        int validity = ucn_valid_in_identifier (pfile, result, nst);
>            
>        if (validity == 0)
>          cpp_error (pfile, CPP_DL_ERROR,
>                     "universal character %.*s is not valid in an identifier",
>                     (int) (str - base), base);
>        else if (validity == 2 && identifier_pos == 1)
>          cpp_error (pfile, CPP_DL_ERROR,
>     "universal character %.*s is not valid at the start of an identifier",
>                     (int) (str - base), base);
>      }
> so basically all those invalid in identifiers cases emit an error and
> pretend to be valid in identifiers, rather than what e.g.  _cpp_valid_utf8
> does for C but not for C++ and only for the chars completely invalid in
> identifiers rather than just valid in identifiers but not at the start:
>            /* In C++, this is an error for invalid character in an identifier
>               because logically, the UTF-8 was converted to a UCN during
>               translation phase 1 (even though we don't physically do it that
>               way).  In C, this byte rather becomes grammatically a separate
>               token.  */
>     
>            if (CPP_OPTION (pfile, cplusplus))
>              cpp_error (pfile, CPP_DL_ERROR,
>                         "extended character %.*s is not valid in an identifier",
>                         (int) (*pstr - base), base);
>            else
>              {
>                *pstr = base;
>                return false;
>              }
> The comment doesn't really match what is done in recent C++ versions because
> there UCNs are translated to characters and not the other way around.

...it seems wrong that calling forms_identifier_p gives an error and 
returns true for characters that can't be part of an identifier, which I 
would expect to produce a false result.  If we want to complain about 
the pptoken#2 issue, that seems like it should happen in the CPP_OTHER 
section of _cpp_lex_direct.

Our diagnostic for STR(\u0041) is similarly unhelpful, saying just "not 
valid in an identifier" rather than anything about the basic character 
set or that it should be spelled "A".

But if we're going to give an error either way, fixing this seems a low 
priority.

> 2024-07-17  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/110343
> libcpp/
> 	* lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
> 	(lex_raw_string): For C++26 allow $@` characters in prefix.
> 	* charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
> gcc/testsuite/
> 	* c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
> 	remove c++ specific dg-options.
> 	* c-c++-common/raw-string-2.c: Likewise.
> 	* c-c++-common/raw-string-4.c: Likewise.
> 	* c-c++-common/raw-string-5.c: Likewise.  Expect some diagnostics
> 	only for non-c++26, for c++26 expect different.
> 	* c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
> 	remove c++ specific dg-options.
> 	* c-c++-common/raw-string-11.c: Likewise.
> 	* c-c++-common/raw-string-13.c: Likewise.
> 	* c-c++-common/raw-string-14.c: Likewise.
> 	* c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
> 	change c++ specific dg-options to just -Wtrigraphs.
> 	* c-c++-common/raw-string-16.c: Likewise.
> 	* c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
> 	remove c++ specific dg-options.
> 	* c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
> 	remove -std=c++11 from c++ specific dg-options.
> 	* c-c++-common/raw-string-19.c: Likewise.
> 	* g++.dg/cpp26/raw-string1.C: New test.
> 	* g++.dg/cpp26/raw-string2.C: New test.
> 
> --- libcpp/lex.cc.jj	2024-07-17 11:36:49.897873247 +0200
> +++ libcpp/lex.cc	2024-07-17 20:04:43.936793506 +0200
> @@ -2718,7 +2718,10 @@ lex_raw_string (cpp_reader *pfile, cpp_t
>   		       || c == '*' || c == '+' || c == '-' || c == '/'
>   		       || c == '^' || c == '&' || c == '|' || c == '~'
>   		       || c == '!' || c == '=' || c == ','
> -		       || c == '"' || c == '\''))
> +		       || c == '"' || c == '\''
> +		       || ((c == '$' || c == '@' || c == '`')
> +			   && CPP_OPTION (pfile, cplusplus)
> +			   && CPP_OPTION (pfile, lang) > CLK_CXX23)))
>   	    prefix[prefix_len++] = c;
>   	  else
>   	    {
> --- libcpp/charset.cc.jj	2024-01-05 08:35:13.696827331 +0100
> +++ libcpp/charset.cc	2024-07-17 20:18:13.665467035 +0200
> @@ -1808,7 +1808,12 @@ _cpp_valid_ucn (cpp_reader *pfile, const
>         result = 1;
>       }
>     else if (identifier_pos && result == 0x24
> -	   && CPP_OPTION (pfile, dollars_in_ident))
> +	   && CPP_OPTION (pfile, dollars_in_ident)
> +	   /* In C++26 when dollars are allowed in identifiers,
> +	      we should still reject \u0024 as $ is part of the basic
> +	      character set.  */
> +	   && !(CPP_OPTION (pfile, cplusplus)
> +		&& CPP_OPTION (pfile, lang) > CLK_CXX23))

I wonder about moving $ handling into the next else, so we don't need to 
worry about the basic charset here?

But the patch is OK.

Jason
Jason Merrill July 26, 2024, 3:43 p.m. UTC | #2
On 7/17/24 6:04 PM, Jakub Jelinek wrote:
> Hi!
> 
> The following patch implements the easy parts of the paper.
> When @$` are added to the basic character set, it means that
> R"@$`()@$`" should now be valid (here I've noticed most of the
> raw string tests were tested solely with -std=c++11 or -std=gnu++11
> and I've tried to change that), and on the other side even if
> by extension $ is allowed in identifiers, \u0024 or \U00000024
> or \u{24} should not be, similarly how \u0041 is not allowed.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> The paper in 3.1 claims though that
> #include <stdio.h>
> 
> #define STR(x) #x
> 
> int main()
> {
>    printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
> }
> should have been accepted before this paper (and rejected after it),
> but g++ rejects it.
> 
> I've tried to understand it, but am confused on what is the right
> behavior and why.
> 
> Consider
> #define STR(x) #x
> const char *a = "\u00b7";
> const char *b = STR(\u00b7);
> const char *c = "\u0041";
> const char *d = STR(\u0041);
> const char *e = STR(a\u00b7);
> const char *f = STR(a\u0041);
> const char *g = STR(a \u00b7);
> const char *h = STR(a \u0041);
> const char *i = "\u066d";
> const char *j = STR(\u066d);
> const char *k = "\u0040";
> const char *l = STR(\u0040);
> const char *m = STR(a\u066d);
> const char *n = STR(a\u0040);
> const char *o = STR(a \u066d);
> const char *p = STR(a \u0040);
> 
> Neither clang nor gcc emit any diagnostics on the a, c, i and k
> initializers, those are certainly valid (c is invalid in C23 though).  g++
> emits with -pedantic-errors errors on all the others, while clang++ on the
> ones with STR involving \u0041, \u0040 and a\u0066d.  The chosen values are
> \u0040 '@' as something being changed by this paper, \u0041 'A' as basic
> character set char valid in identifiers before/after, \u00b7 as an example
> of character which is pedantically valid in identifiers if not at the start
> and \u066d s something pedantically not valid in identifiers.
> 
> Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
> string/character literal which corresponds to basic character set character
> (or control character) is ill-formed, that would make d, f, h cases invalid
> for C++ and l, n, p cases invalid for C++26.
> 
> https://eel.is/c++draft/lex.name states which characters can appear at the
> start of the identifier and which can appear after the start.  And
> https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
> either identifier, or tons of other things, or "each non-whitespace
> character that cannot be one of the above"
> 
> Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
> invalid if the preprocessing token is being converted into token.
> 
> And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
> the basic character set matches the last category, the program is
> ill-formed."
> 
> Now, e.g.  for the C++23 STR(\u0040) case, \u0040 is there not in the basic
> character set, so valid outside of the literals (not the case anymore in
> C++26), but it isn't nondigit and doesn't have XID_Start property, so it
> isn't IMHO an identifier and so must be the "each non-whitespace character
> that cannot be one of the above" case.  Why doesn't the above mentioned
> https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid?  Ignoring
> that, I'd say it would be then stringized and that feels like it is what
> clang++ is doing.  Now, e.g.  for the STR(a\u066d) case, I wonder why that
> isn't lexed as a identifier followed by \u066d "each non-whitespace
> character that cannot be one of the above" token and stringified similarly,
> clang++ rejects that.
> 
> What GCC libcpp seems to be doing is that if that forms_identifier_p calls
> _cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
> or second+ in identifier, and e.g.  _cpp_valid_ucn then for UCNs valid in
> string literals calls
>    else if (identifier_pos)
>      {
>        int validity = ucn_valid_in_identifier (pfile, result, nst);
>            
>        if (validity == 0)
>          cpp_error (pfile, CPP_DL_ERROR,
>                     "universal character %.*s is not valid in an identifier",
>                     (int) (str - base), base);
>        else if (validity == 2 && identifier_pos == 1)
>          cpp_error (pfile, CPP_DL_ERROR,
>     "universal character %.*s is not valid at the start of an identifier",
>                     (int) (str - base), base);
>      }
> so basically all those invalid in identifiers cases emit an error and
> pretend to be valid in identifiers, rather than what e.g.  _cpp_valid_utf8
> does for C but not for C++ and only for the chars completely invalid in
> identifiers rather than just valid in identifiers but not at the start:
>            /* In C++, this is an error for invalid character in an identifier
>               because logically, the UTF-8 was converted to a UCN during
>               translation phase 1 (even though we don't physically do it that
>               way).  In C, this byte rather becomes grammatically a separate
>               token.  */
>     
>            if (CPP_OPTION (pfile, cplusplus))
>              cpp_error (pfile, CPP_DL_ERROR,
>                         "extended character %.*s is not valid in an identifier",
>                         (int) (*pstr - base), base);
>            else
>              {
>                *pstr = base;
>                return false;
>              }
> The comment doesn't really match what is done in recent C++ versions because
> there UCNs are translated to characters and not the other way around.
> 
> 2024-07-17  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/110343
> libcpp/
> 	* lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
> 	(lex_raw_string): For C++26 allow $@` characters in prefix.
> 	* charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
> gcc/testsuite/
> 	* c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
> 	remove c++ specific dg-options.
> 	* c-c++-common/raw-string-2.c: Likewise.
> 	* c-c++-common/raw-string-4.c: Likewise.
> 	* c-c++-common/raw-string-5.c: Likewise.  Expect some diagnostics
> 	only for non-c++26, for c++26 expect different.
> 	* c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
> 	remove c++ specific dg-options.
> 	* c-c++-common/raw-string-11.c: Likewise.
> 	* c-c++-common/raw-string-13.c: Likewise.
> 	* c-c++-common/raw-string-14.c: Likewise.
> 	* c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
> 	change c++ specific dg-options to just -Wtrigraphs.
> 	* c-c++-common/raw-string-16.c: Likewise.
> 	* c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
> 	remove c++ specific dg-options.
> 	* c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
> 	remove -std=c++11 from c++ specific dg-options.
> 	* c-c++-common/raw-string-19.c: Likewise.
> 	* g++.dg/cpp26/raw-string1.C: New test.
> 	* g++.dg/cpp26/raw-string2.C: New test.

I'm now seeing a -std=c++26 failure on g++.dg/cpp/ucn-1.C.

Jason
Jakub Jelinek July 26, 2024, 3:55 p.m. UTC | #3
On Fri, Jul 26, 2024 at 11:43:13AM -0400, Jason Merrill wrote:
> I'm now seeing a -std=c++26 failure on g++.dg/cpp/ucn-1.C.

I don't remember seeing it when I wrote the patch, but today I see it as
well.

The following patch seems to fix that, tested on i686-linux, ok for trunk?

2024-07-26  Jakub Jelinek  <jakub@redhat.com>

	* g++.dg/cpp/ucn-1.C (main): Expect error on c\u0024c identifier also
	for C++26.

--- gcc/testsuite/g++.dg/cpp/ucn-1.C.jj	2020-01-14 20:02:46.702611047 +0100
+++ gcc/testsuite/g++.dg/cpp/ucn-1.C	2024-07-26 17:52:33.881518790 +0200
@@ -9,7 +9,7 @@ int main()
 
   int c\u0041c;		// { dg-error "not valid in an identifier" }
 		// $ is OK on most targets; not part of basic source char set
-  int c\u0024c;	// { dg-error "not valid in an identifier" "" { target { powerpc-ibm-aix* } } }
+  int c\u0024c;	// { dg-error "not valid in an identifier" "" { target { { powerpc-ibm-aix* } || c++26 } } }
 
   U"\uD800";		  // { dg-error "not a valid universal character" }
 


	Jakub
Jason Merrill July 26, 2024, 5:25 p.m. UTC | #4
On 7/26/24 11:55 AM, Jakub Jelinek wrote:
> On Fri, Jul 26, 2024 at 11:43:13AM -0400, Jason Merrill wrote:
>> I'm now seeing a -std=c++26 failure on g++.dg/cpp/ucn-1.C.
> 
> I don't remember seeing it when I wrote the patch, but today I see it as
> well.
> 
> The following patch seems to fix that, tested on i686-linux, ok for trunk?

OK.

> 2024-07-26  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* g++.dg/cpp/ucn-1.C (main): Expect error on c\u0024c identifier also
> 	for C++26.
> 
> --- gcc/testsuite/g++.dg/cpp/ucn-1.C.jj	2020-01-14 20:02:46.702611047 +0100
> +++ gcc/testsuite/g++.dg/cpp/ucn-1.C	2024-07-26 17:52:33.881518790 +0200
> @@ -9,7 +9,7 @@ int main()
>   
>     int c\u0041c;		// { dg-error "not valid in an identifier" }
>   		// $ is OK on most targets; not part of basic source char set
> -  int c\u0024c;	// { dg-error "not valid in an identifier" "" { target { powerpc-ibm-aix* } } }
> +  int c\u0024c;	// { dg-error "not valid in an identifier" "" { target { { powerpc-ibm-aix* } || c++26 } } }
>   
>     U"\uD800";		  // { dg-error "not a valid universal character" }
>   
> 
> 
> 	Jakub
>
diff mbox series

Patch

--- libcpp/lex.cc.jj	2024-07-17 11:36:49.897873247 +0200
+++ libcpp/lex.cc	2024-07-17 20:04:43.936793506 +0200
@@ -2718,7 +2718,10 @@  lex_raw_string (cpp_reader *pfile, cpp_t
 		       || c == '*' || c == '+' || c == '-' || c == '/'
 		       || c == '^' || c == '&' || c == '|' || c == '~'
 		       || c == '!' || c == '=' || c == ','
-		       || c == '"' || c == '\''))
+		       || c == '"' || c == '\''
+		       || ((c == '$' || c == '@' || c == '`')
+			   && CPP_OPTION (pfile, cplusplus)
+			   && CPP_OPTION (pfile, lang) > CLK_CXX23)))
 	    prefix[prefix_len++] = c;
 	  else
 	    {
--- libcpp/charset.cc.jj	2024-01-05 08:35:13.696827331 +0100
+++ libcpp/charset.cc	2024-07-17 20:18:13.665467035 +0200
@@ -1808,7 +1808,12 @@  _cpp_valid_ucn (cpp_reader *pfile, const
       result = 1;
     }
   else if (identifier_pos && result == 0x24 
-	   && CPP_OPTION (pfile, dollars_in_ident))
+	   && CPP_OPTION (pfile, dollars_in_ident)
+	   /* In C++26 when dollars are allowed in identifiers,
+	      we should still reject \u0024 as $ is part of the basic
+	      character set.  */
+	   && !(CPP_OPTION (pfile, cplusplus)
+		&& CPP_OPTION (pfile, lang) > CLK_CXX23))
     {
       if (CPP_OPTION (pfile, warn_dollars) && !pfile->state.skipping)
 	{
--- gcc/testsuite/c-c++-common/raw-string-1.c.jj	2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-1.c	2024-07-17 20:31:02.272652757 +0200
@@ -1,7 +1,6 @@ 
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-2.c.jj	2020-01-12 11:54:37.023404206 +0100
+++ gcc/testsuite/c-c++-common/raw-string-2.c	2024-07-17 20:31:18.415446546 +0200
@@ -1,7 +1,6 @@ 
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-4.c.jj	2020-01-12 11:54:37.023404206 +0100
+++ gcc/testsuite/c-c++-common/raw-string-4.c	2024-07-17 20:31:51.590022777 +0200
@@ -1,7 +1,6 @@ 
 // R is not applicable for character literals.
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 const int	i0	= R'a';	// { dg-error "was not declared|undeclared" "undeclared" }
 		// { dg-error "expected ',' or ';'" "expected" { target c } .-1 }
--- gcc/testsuite/c-c++-common/raw-string-5.c.jj	2020-07-28 15:39:09.992756448 +0200
+++ gcc/testsuite/c-c++-common/raw-string-5.c	2024-07-17 20:56:46.522822013 +0200
@@ -1,6 +1,5 @@ 
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 const void *s0 = R"0123456789abcdefg()0123456789abcdefg" 0;
 	// { dg-error "raw string delimiter longer" "longer" { target *-*-* } .-1 }
@@ -15,12 +14,18 @@  const void *s3 = R")())" 0;
 	// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
 	// { dg-error "stray" "stray" { target *-*-* } .-2 }
 const void *s4 = R"@()@" 0;
-	// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
-	// { dg-error "stray" "stray" { target *-*-* } .-2 }
+	// { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
+	// { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+	// { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
 const void *s5 = R"$()$" 0;
-	// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
-	// { dg-error "stray" "stray" { target *-*-* } .-2 }
-const void *s6 = R"\u0040()\u0040" 0;
+	// { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
+	// { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+	// { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
+const void *s6 = R"`()`" 0;
+	// { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
+	// { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+	// { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
+const void *s7 = R"\u0040()\u0040" 0;
 	// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
 	// { dg-error "stray" "stray" { target *-*-* } .-2 }
 
--- gcc/testsuite/c-c++-common/raw-string-6.c.jj	2020-12-28 12:27:32.500752614 +0100
+++ gcc/testsuite/c-c++-common/raw-string-6.c	2024-07-17 20:32:26.193580759 +0200
@@ -1,6 +1,5 @@ 
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 const void *s0 = R"ouch()ouCh"; 	// { dg-error "unterminated raw string" "unterminated" }
 // { dg-error "at end of input" "end" { target *-*-* } .-1 }
--- gcc/testsuite/c-c++-common/raw-string-11.c.jj	2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-11.c	2024-07-17 20:33:54.236456112 +0200
@@ -1,7 +1,7 @@ 
 // PR preprocessor/48740
+// { dg-do run { target { c || c++11 } } }
 // { dg-options "-std=gnu99 -trigraphs -save-temps" { target c } }
-// { dg-options "-std=c++0x -save-temps" { target c++ } }
-// { dg-do run }
+// { dg-options "-save-temps" { target c++ } }
 
 int main ()
 {
@@ -9,4 +9,3 @@  int main ()
 			   "foo%sbar%sfred%sbob?""?""?""?""?",
 			   sizeof ("foo%sbar%sfred%sbob?""?""?""?""?"));
 }
-
--- gcc/testsuite/c-c++-common/raw-string-13.c.jj	2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-13.c	2024-07-17 20:34:23.669080145 +0200
@@ -1,8 +1,7 @@ 
 // PR preprocessor/57620
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++11" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-14.c.jj	2020-07-28 15:39:09.992756448 +0200
+++ gcc/testsuite/c-c++-common/raw-string-14.c	2024-07-17 20:34:43.507826727 +0200
@@ -1,7 +1,6 @@ 
 // PR preprocessor/57620
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99 -trigraphs" { target c } }
-// { dg-options "-std=c++11" { target c++ } }
 
 const void *s0 = R"abc\
 def()abcdef" 0;
--- gcc/testsuite/c-c++-common/raw-string-15.c.jj	2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-15.c	2024-07-17 20:34:58.994628892 +0200
@@ -1,8 +1,8 @@ 
 // PR preprocessor/57620
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -Wtrigraphs" { target c } }
-// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
+// { dg-options "-Wtrigraphs" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-16.c.jj	2020-07-28 15:39:09.992756448 +0200
+++ gcc/testsuite/c-c++-common/raw-string-16.c	2024-07-17 20:35:22.387330085 +0200
@@ -1,7 +1,7 @@ 
 // PR preprocessor/57620
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99 -Wtrigraphs" { target c } }
-// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
+// { dg-options "-Wtrigraphs" { target c++ } }
 
 const void *s0 = R"abc\
 def()abcdef" 0;
--- gcc/testsuite/c-c++-common/raw-string-17.c.jj	2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-17.c	2024-07-17 20:35:36.497149845 +0200
@@ -1,7 +1,6 @@ 
 /* PR preprocessor/57824 */
-/* { dg-do run } */
+/* { dg-do run { target { c || c++11 } } } */
 /* { dg-options "-std=gnu99" { target c } } */
-/* { dg-options "-std=c++11" { target c++ } } */
 
 #define S(s) s
 #define T(s) s "\n"
--- gcc/testsuite/c-c++-common/raw-string-18.c.jj	2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-18.c	2024-07-17 20:35:55.151911555 +0200
@@ -1,7 +1,7 @@ 
 /* PR preprocessor/57824 */
-/* { dg-do compile } */
+/* { dg-do compile { target { c || c++11 } } } */
 /* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno" { target c } } */
-/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno" { target c++ } } */
+/* { dg-options "-fdump-tree-optimized-lineno" { target c++ } } */
 
 const char x[] = R"(
 abc
--- gcc/testsuite/c-c++-common/raw-string-19.c.jj	2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-19.c	2024-07-17 20:36:25.445524589 +0200
@@ -1,7 +1,7 @@ 
 /* PR preprocessor/57824 */
-/* { dg-do compile } */
+// { dg-do compile { target { c || c++11 } } }
 /* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno -save-temps" { target c } } */
-/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno -save-temps" { target c++ } } */
+/* { dg-options "-fdump-tree-optimized-lineno -save-temps" { target c++ } } */
 
 const char x[] = R"(
 abc
--- gcc/testsuite/g++.dg/cpp26/raw-string1.C.jj	2024-07-17 20:46:06.878052479 +0200
+++ gcc/testsuite/g++.dg/cpp26/raw-string1.C	2024-07-17 20:47:50.761715122 +0200
@@ -0,0 +1,4 @@ 
+// C++26 P2558R2 - Add @, $, and ` to the basic character set
+// { dg-do compile { target c++26 } }
+
+const char *s0 = R"`@$$@`@`$()`@$$@`@`$";
--- gcc/testsuite/g++.dg/cpp26/raw-string2.C.jj	2024-07-17 20:54:53.478273235 +0200
+++ gcc/testsuite/g++.dg/cpp26/raw-string2.C	2024-07-17 20:58:46.177289931 +0200
@@ -0,0 +1,7 @@ 
+// C++26 P2558R2 - Add @, $, and ` to the basic character set
+// { dg-do compile { target { ! { avr*-*-* mmix*-*-* *-*-aix* } } } }
+// { dg-options "" }
+
+int a$b;
+int a\u0024c;		// { dg-error "universal character \\\\u0024 is not valid in an identifier" "" { target c++26 } }
+int a\U00000024d;	// { dg-error "universal character \\\\U00000024 is not valid in an identifier" "" { target c++26 } }