From patchwork Tue May 7 13:49:08 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Wakely X-Patchwork-Id: 1932486 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=f/9TnOSp; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VYfls0zZTz1xnT for ; Tue, 7 May 2024 23:49:45 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 68E8D384641E for ; Tue, 7 May 2024 13:49:43 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id B2D513844742 for ; Tue, 7 May 2024 13:49:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B2D513844742 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B2D513844742 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715089759; cv=none; b=QyRNLQZ8h0rTpFKO476TYxyGSRpUH8EOlxR4FTFDIz3K4qn6loQg4a3T8+SIzIk6Rqjg54S9+5UczXN9zVxt4GydAkcdMnL2Kz6C+Smyy/QR4Eyir8OPgnAVQNj1qvFBPHi8uqYN3VkJwhGeutxS9Et+SzxFO3tCFhFEw59stCs= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715089759; c=relaxed/simple; bh=64aEwPzciCdI+pVrdMnaHs8UyElaih54SVoOTM9CJ4g=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=CI9pYiT4tDcJ7nXcCTcKz4ryCxzZaHrNhausm+hBjwjhOB8af7gz4Y6XVkpgG4mHd8ECYX+CxOOUPdjAnnQ0YTLNzxHqwcwJd8bVEgGWstsErX02K9g8DWqLjLiERyenzbrre7LG2W1xuOF0g+68IbQ6FER4L/LZa8dPxHCPoRc= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715089757; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=ftixVcMH+S4LPE9XJ4Yy8FVd7fQx0gDWNkT5jP0YAZE=; b=f/9TnOSpKQ1Z8GBzlnaAQxAIKxaSPrLK0P2EuT9Th9ocYaG2WtrplZbXVuk563cMNk4fcU c6Ma+0fULv8yVOKp3+yaYvHkEW+fM8+oQhVhzx47GexlpUR8FmQF/T6x+b02YVieeX2j7y w2l5nMj2ZgOtJyYCBOaK3q+YHy4oLJo= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-190-Dek55YlJOxyD7G3fd9_9VA-1; Tue, 07 May 2024 09:49:15 -0400 X-MC-Unique: Dek55YlJOxyD7G3fd9_9VA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 91A5729AB407; Tue, 7 May 2024 13:49:15 +0000 (UTC) Received: from localhost (unknown [10.42.28.238]) by smtp.corp.redhat.com (Postfix) with ESMTP id 61C2A200C7E6; Tue, 7 May 2024 13:49:15 +0000 (UTC) From: Jonathan Wakely To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org Subject: [committed] libstdc++: Fix handling of incomplete UTF-8 sequences in _Unicode_view Date: Tue, 7 May 2024 14:49:08 +0100 Message-ID: <20240507134914.3820922-1-jwakely@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Tested x86_64-linux. Pushed to trunk. gcc-14 backport to follow. -- >8 -- Eddie Nolan reported to me that _Unicode_view was not correctly implementing the substitution of ill-formed subsequences with U+FFFD, due to failing to increment the counter when the iterator reaches the end of the sequence before a multibyte sequence is complete. As a result, the incomplete sequence was not completely consumed, and then the remaining character was treated as another ill-formed sequence, giving two U+FFFD characters instead of one. To avoid similar mistakes in future, this change introduces a lambda that increments the iterator and the counter together. This ensures the counter is always incremented when the iterator is incremented, so that we always know how many characters have been consumed. libstdc++-v3/ChangeLog: * include/bits/unicode.h (_Unicode_view::_M_read_utf8): Ensure count of characters consumed is correct when the end of the input is reached unexpectedly. * testsuite/ext/unicode/view.cc: Test incomplete UTF-8 sequences. --- libstdc++-v3/include/bits/unicode.h | 24 ++++++++++------------ libstdc++-v3/testsuite/ext/unicode/view.cc | 7 +++++++ 2 files changed, 18 insertions(+), 13 deletions(-) diff --git a/libstdc++-v3/include/bits/unicode.h b/libstdc++-v3/include/bits/unicode.h index 29813b743dc..46238143fb6 100644 --- a/libstdc++-v3/include/bits/unicode.h +++ b/libstdc++-v3/include/bits/unicode.h @@ -261,9 +261,13 @@ namespace __unicode { _Guard<_Iter> __g{this, _M_curr()}; char32_t __c{}; - uint8_t __u = *_M_curr()++; const uint8_t __lo_bound = 0x80, __hi_bound = 0xBF; + uint8_t __u = *_M_curr()++; uint8_t __to_incr = 1; + auto __incr = [&, this] { + ++__to_incr; + return ++_M_curr(); + }; if (__u <= 0x7F) [[likely]] // 0x00 to 0x7F __c = __u; @@ -281,8 +285,7 @@ namespace __unicode else { __c = (__c << 6) | (__u & 0x3F); - ++_M_curr(); - ++__to_incr; + __incr(); } } else if (__u <= 0xEF) // 0xE0 to 0xEF @@ -295,11 +298,10 @@ namespace __unicode if (__u < __lo_bound_2 || __u > __hi_bound_2) [[unlikely]] __c = _S_error(); - else if (++_M_curr() == _M_last) [[unlikely]] + else if (__incr() == _M_last) [[unlikely]] __c = _S_error(); else { - ++__to_incr; __c = (__c << 6) | (__u & 0x3F); __u = *_M_curr(); @@ -308,8 +310,7 @@ namespace __unicode else { __c = (__c << 6) | (__u & 0x3F); - ++_M_curr(); - ++__to_incr; + __incr(); } } } @@ -323,21 +324,19 @@ namespace __unicode if (__u < __lo_bound_2 || __u > __hi_bound_2) [[unlikely]] __c = _S_error(); - else if (++_M_curr() == _M_last) [[unlikely]] + else if (__incr() == _M_last) [[unlikely]] __c = _S_error(); else { - ++__to_incr; __c = (__c << 6) | (__u & 0x3F); __u = *_M_curr(); if (__u < __lo_bound || __u > __hi_bound) [[unlikely]] __c = _S_error(); - else if (++_M_curr() == _M_last) [[unlikely]] + else if (__incr() == _M_last) [[unlikely]] __c = _S_error(); else { - ++__to_incr; __c = (__c << 6) | (__u & 0x3F); __u = *_M_curr(); @@ -346,8 +345,7 @@ namespace __unicode else { __c = (__c << 6) | (__u & 0x3F); - ++_M_curr(); - ++__to_incr; + __incr(); } } } diff --git a/libstdc++-v3/testsuite/ext/unicode/view.cc b/libstdc++-v3/testsuite/ext/unicode/view.cc index ee23b0b1d8a..6f3c099bd84 100644 --- a/libstdc++-v3/testsuite/ext/unicode/view.cc +++ b/libstdc++-v3/testsuite/ext/unicode/view.cc @@ -55,6 +55,13 @@ test_illformed_utf8() VERIFY( std::ranges::equal(v5, u8"\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\x41\uFFFD\uFFFD\x42"sv) ); uc::_Utf8_view v6("\xe1\x80\xe2\xf0\x91\x92\xf1\xbf\x41"sv); // Table 3-11 VERIFY( std::ranges::equal(v6, u8"\uFFFD\uFFFD\uFFFD\uFFFD\x41"sv) ); + + uc::_Utf32_view v7("\xe1\x80"sv); + VERIFY( std::ranges::equal(v7, U"\uFFFD"sv) ); + uc::_Utf32_view v8("\xf1\x80"sv); + VERIFY( std::ranges::equal(v8, U"\uFFFD"sv) ); + uc::_Utf32_view v9("\xf1\x80\x80"sv); + VERIFY( std::ranges::equal(v9, U"\uFFFD"sv) ); } constexpr void