c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

Hi!

The following patch implements the easy parts of the paper.
When @$` are added to the basic character set, it means that
R"@$`()@$`" should now be valid (here I've noticed most of the
raw string tests were tested solely with -std=c++11 or -std=gnu++11
and I've tried to change that), and on the other side even if
by extension $ is allowed in identifiers, \u0024 or \U00000024
or \u{24} should not be, similarly how \u0041 is not allowed.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

The paper in 3.1 claims though that
#include <stdio.h>

#define STR(x) #x

int main()
{
  printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
}
should have been accepted before this paper (and rejected after it),
but g++ rejects it.

I've tried to understand it, but am confused on what is the right
behavior and why.

Consider
#define STR(x) #x
const char *a = "\u00b7";
const char *b = STR(\u00b7);
const char *c = "\u0041";
const char *d = STR(\u0041);
const char *e = STR(a\u00b7);
const char *f = STR(a\u0041);
const char *g = STR(a \u00b7);
const char *h = STR(a \u0041);
const char *i = "\u066d";
const char *j = STR(\u066d);
const char *k = "\u0040";
const char *l = STR(\u0040);
const char *m = STR(a\u066d);
const char *n = STR(a\u0040);
const char *o = STR(a \u066d);
const char *p = STR(a \u0040);

Neither clang nor gcc emit any diagnostics on the a, c, i and k
initializers, those are certainly valid (c is invalid in C23 though).  g++
emits with -pedantic-errors errors on all the others, while clang++ on the
ones with STR involving \u0041, \u0040 and a\u0066d.  The chosen values are
\u0040 '@' as something being changed by this paper, \u0041 'A' as basic
character set char valid in identifiers before/after, \u00b7 as an example
of character which is pedantically valid in identifiers if not at the start
and \u066d s something pedantically not valid in identifiers.

Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
string/character literal which corresponds to basic character set character
(or control character) is ill-formed, that would make d, f, h cases invalid
for C++ and l, n, p cases invalid for C++26.

https://eel.is/c++draft/lex.name states which characters can appear at the
start of the identifier and which can appear after the start.  And
https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
either identifier, or tons of other things, or "each non-whitespace
character that cannot be one of the above"

Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
invalid if the preprocessing token is being converted into token.

And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
the basic character set matches the last category, the program is
ill-formed."

Now, e.g.  for the C++23 STR(\u0040) case, \u0040 is there not in the basic
character set, so valid outside of the literals (not the case anymore in
C++26), but it isn't nondigit and doesn't have XID_Start property, so it
isn't IMHO an identifier and so must be the "each non-whitespace character
that cannot be one of the above" case.  Why doesn't the above mentioned
https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid?  Ignoring
that, I'd say it would be then stringized and that feels like it is what
clang++ is doing.  Now, e.g.  for the STR(a\u066d) case, I wonder why that
isn't lexed as a identifier followed by \u066d "each non-whitespace
character that cannot be one of the above" token and stringified similarly,
clang++ rejects that.

What GCC libcpp seems to be doing is that if that forms_identifier_p calls
_cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
or second+ in identifier, and e.g.  _cpp_valid_ucn then for UCNs valid in
string literals calls
  else if (identifier_pos)
    {
      int validity = ucn_valid_in_identifier (pfile, result, nst);

      if (validity == 0)
        cpp_error (pfile, CPP_DL_ERROR,
                   "universal character %.*s is not valid in an identifier",
                   (int) (str - base), base);
      else if (validity == 2 && identifier_pos == 1)
        cpp_error (pfile, CPP_DL_ERROR,
   "universal character %.*s is not valid at the start of an identifier",
                   (int) (str - base), base);
    }
so basically all those invalid in identifiers cases emit an error and
pretend to be valid in identifiers, rather than what e.g.  _cpp_valid_utf8
does for C but not for C++ and only for the chars completely invalid in
identifiers rather than just valid in identifiers but not at the start:
          /* In C++, this is an error for invalid character in an identifier
             because logically, the UTF-8 was converted to a UCN during
             translation phase 1 (even though we don't physically do it that
             way).  In C, this byte rather becomes grammatically a separate
             token.  */

          if (CPP_OPTION (pfile, cplusplus))
            cpp_error (pfile, CPP_DL_ERROR,
                       "extended character %.*s is not valid in an identifier",
                       (int) (*pstr - base), base);
          else
            {
              *pstr = base;
              return false;
            }
The comment doesn't really match what is done in recent C++ versions because
there UCNs are translated to characters and not the other way around.

2024-07-17  Jakub Jelinek  <jakub@redhat.com>

	PR c++/110343
libcpp/
	* lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
	(lex_raw_string): For C++26 allow $@` characters in prefix.
	* charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
gcc/testsuite/
	* c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
	remove c++ specific dg-options.
	* c-c++-common/raw-string-2.c: Likewise.
	* c-c++-common/raw-string-4.c: Likewise.
	* c-c++-common/raw-string-5.c: Likewise.  Expect some diagnostics
	only for non-c++26, for c++26 expect different.
	* c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
	remove c++ specific dg-options.
	* c-c++-common/raw-string-11.c: Likewise.
	* c-c++-common/raw-string-13.c: Likewise.
	* c-c++-common/raw-string-14.c: Likewise.
	* c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
	change c++ specific dg-options to just -Wtrigraphs.
	* c-c++-common/raw-string-16.c: Likewise.
	* c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
	remove c++ specific dg-options.
	* c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
	remove -std=c++11 from c++ specific dg-options.
	* c-c++-common/raw-string-19.c: Likewise.
	* g++.dg/cpp26/raw-string1.C: New test.
	* g++.dg/cpp26/raw-string2.C: New test.

	Jakub

Message ID	Zpg/3AiW41ccEeKL@tucnak
State	New
Headers	show Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=HAP1ZPzy; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4WPVND3wxfz20B2 for <incoming@patchwork.ozlabs.org>; Thu, 18 Jul 2024 08:04:44 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 93C2F385841D for <incoming@patchwork.ozlabs.org>; Wed, 17 Jul 2024 22:04:42 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 0FADB385841D for <gcc-patches@gcc.gnu.org>; Wed, 17 Jul 2024 22:04:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0FADB385841D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0FADB385841D Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1721253862; cv=none; b=dqTB1hNTdv+W3WMzQdWzVzV5vNQnySRzIU/FyVFq7shTabTru7fusEOXRveUh3bODKNiKmTkNt6iGWctTKff/pP4FJEJsR6mTGEla8cjYsuE6fkaKWPq9KOqLrjJsAGNWNbFLQ6fEyuB6hhQeCfgwEV6urDxjRBTIBR/BHjjVXo= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1721253862; c=relaxed/simple; bh=XZkjXYde2hNvv5lvOCqFs05PPaK0HOimpTvHIfcJIL4=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=wqycnz3s8i6UAGgXCh/JFV1Apl2Pnc3/ZizyCaMbQpi4L5t+8+tUBrPKLT06AVKOB8AimglLJoCdxFCEBupTGZFttz+BmIRaAyQVKJNY/cg/DlCk5FB9gCrJWaWxcjL1Wlp7ccZ24K9VRygT23kxfQbxddy1i6xTxZNdFRvPu18= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1721253859; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=0xRvWhhIQ7ZmLNiOIpADShegSWZEvVi/hmk8/uhFiuA=; b=HAP1ZPzyDGC6OhVBacmpWU+8URi3gaRHsR3oa9xZLIX8tsMOw07CDd377KRal79qei7qwV U2fOJ8YWvArtOOGX5pAi/75DkKMHGOrBJCMimAodMacceTLpxUb8GxH+xgX6FXhAmg7Fp/ rFXjMe61K1U45zLkoxnyhRZMB8WzroY= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-489-AvMhu3UAMrGJumUUVeZ9bw-1; Wed, 17 Jul 2024 18:04:18 -0400 X-MC-Unique: AvMhu3UAMrGJumUUVeZ9bw-1 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9839B1955F49 for <gcc-patches@gcc.gnu.org>; Wed, 17 Jul 2024 22:04:16 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.45.224.25]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 972941955F3B; Wed, 17 Jul 2024 22:04:15 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 46HM4CDk4097551 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Thu, 18 Jul 2024 00:04:12 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 46HM4Cvk4097550; Thu, 18 Jul 2024 00:04:12 +0200 Date: Thu, 18 Jul 2024 00:04:12 +0200 From: Jakub Jelinek <jakub@redhat.com> To: Jason Merrill <jason@redhat.com> Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343] Message-ID: <Zpg/3AiW41ccEeKL@tucnak> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, RCVD_IN_SBL_CSS, SPF_HELO_NONE, SPF_NONE, TXREP, WEIRD_QUOTING autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Reply-To: Jakub Jelinek <jakub@redhat.com> Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343] \| expand c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

Commit Message

Comments

Patch