From patchwork Mon Jul 3 15:39:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?0L3QsNCx?= X-Patchwork-Id: 1802855 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=S+zHgRvo; dkim-atps=neutral Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4QvqqC2PSVz20bT for ; Tue, 4 Jul 2023 01:39:35 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1138E3858C66 for ; Mon, 3 Jul 2023 15:39:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1138E3858C66 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1688398773; bh=BuMmcjnhAnrZ8M50ay/iTHhhx2u4y891ka00oHI6cBQ=; h=Date:To:Cc:Subject:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=S+zHgRvoNthKXbkscAr1W4tQecyHJa5igEBPlgpSt5RT8RuvRFfp4VJ1GlhpyuCiv YaedKw239KGs0Q3uFCILyQiglUrBQRMoiMjMPpcD4m9RnbaCzD4nie/mprl5HFclDb z3pZPzY21Iar8XPaPJCLjYfSAgb6xpvfgP3lJDlk= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id EB08B3858D28 for ; Mon, 3 Jul 2023 15:39:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EB08B3858D28 Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id BD20420C0; Mon, 3 Jul 2023 17:39:12 +0200 (CEST) Date: Mon, 3 Jul 2023 17:39:11 +0200 To: Florian Weimer Cc: libc-alpha@sourceware.org, Victor Stinner Subject: [PATCH v16] POSIX locale covers every byte [BZ# 29511] Message-ID: MIME-Version: 1.0 Content-Disposition: inline User-Agent: NeoMutt/20230517 X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_INFOUSMEBIZ, KAM_SHORT, PDS_RDNS_DYNAMIC_FP, RDNS_DYNAMIC, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= From: =?utf-8?b?0L3QsNCx?= Reply-To: =?utf-8?b?0L3QsNCx?= Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" This largely duplicates the ASCII code with the error path changed There are two user-facing changes: * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" * mbrtowc() and friends return b if b <= 0x7F else +b Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, (a) is 1-byte, stateless, and contains 256 characters (b) they collate in byte order (c) the first 128 characters are equivalent to ASCII (like previous) cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of changes to the standard; in short, this means that mbrtowc() must never fail and must return b if b <= 0x7F else ab+c for all bytes b where c is some constant >=0x80 and a is a positive integer constant By strategically picking c= we land at the same point of the Unicode Low Surrogate Area at DC00-DCFF, described as > Isolated surrogate code points have no interpretation; > consequently, no character code charts or names lists > are provided for this range. as the Python UTF-8 errors=surrogateescape encoding. Signed-off-by: Ahelenia Ziemiańska --- Clean rebase. NEWS | 8 ++ iconv/Makefile | 2 +- iconv/gconv_builtin.h | 8 ++ iconv/gconv_int.h | 8 ++ iconv/gconv_posix.c | 94 ++++++++++++++++++ iconv/tst-iconv_prog.sh | 43 +++++++++ iconvdata/tst-tables.sh | 1 + inet/tst-idna_name_classify.c | 6 +- locale/C_name.c | 2 +- locale/tst-C-locale.c | 44 +++++++++ localedata/charmaps/POSIX | 136 ++++++++++++++++++++++++++ localedata/locales/POSIX | 143 +++++++++++++++++++++++++++- localedata/tst-c-utf8-consistency.c | 24 ++--- stdio-common/Makefile | 1 + stdio-common/tst-printf-bz25691.c | 2 + wcsmbs/wcsmbsload.c | 14 +-- 16 files changed, 512 insertions(+), 24 deletions(-) create mode 100644 iconv/gconv_posix.c create mode 100644 localedata/charmaps/POSIX diff --git a/NEWS b/NEWS index 709ee40e50..ec2be7dcd0 100644 --- a/NEWS +++ b/NEWS @@ -48,6 +48,14 @@ Major new features: * The strlcpy and strlcat functions have been added. They are derived from OpenBSD, and are expected to be added to a future POSIX version. +* The default/"POSIX"/"C" locale's character set is now "POSIX", + instead of "ANSI_X3.4-1968" ‒ this is a new fully-reversible + 8-bit transparent encoding for compatibility with POSIX Issue 7 TC 2, + identity-mapping bytes in the ASCII [0, 0x7F] range, + and mapping [0x80, 0xFF] bytes to [, ]. + The standard now requires the "POSIX"/"C" locale to have an encoding + with these features ‒ 8-bit transparency and a continuous collation sequence. + Deprecated and removed features, and other changes affecting compatibility: * In the Linux kernel for the hppa/parisc architecture some of the diff --git a/iconv/Makefile b/iconv/Makefile index afb3fb7bdb..b61e130377 100644 --- a/iconv/Makefile +++ b/iconv/Makefile @@ -25,7 +25,7 @@ include ../Makeconfig headers = iconv.h gconv.h routines = iconv_open iconv iconv_close \ gconv_open gconv gconv_close gconv_db gconv_conf \ - gconv_builtin gconv_simple gconv_trans gconv_cache + gconv_builtin gconv_simple gconv_posix gconv_trans gconv_cache routines += gconv_dl gconv_charset vpath %.c ../locale/programs ../intl diff --git a/iconv/gconv_builtin.h b/iconv/gconv_builtin.h index 35608b4461..d2dcdd44a3 100644 --- a/iconv/gconv_builtin.h +++ b/iconv/gconv_builtin.h @@ -89,6 +89,14 @@ BUILTIN_TRANSFORMATION ("INTERNAL", "ANSI_X3.4-1968//", 1, "=INTERNAL->ascii", __gconv_transform_internal_ascii, NULL, 4, 4, 1, 1) +BUILTIN_TRANSFORMATION ("POSIX//", "INTERNAL", 1, "=posix->INTERNAL", + __gconv_transform_posix_internal, __gconv_btwoc_posix, + 1, 1, 4, 4) + +BUILTIN_TRANSFORMATION ("INTERNAL", "POSIX//", 1, "=INTERNAL->posix", + __gconv_transform_internal_posix, NULL, 4, 4, 1, 1) + + #if BYTE_ORDER == BIG_ENDIAN BUILTIN_ALIAS ("UNICODEBIG//", "ISO-10646/UCS2/") BUILTIN_ALIAS ("UCS-2BE//", "ISO-10646/UCS2/") diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h index 19d042faff..3d0889b321 100644 --- a/iconv/gconv_int.h +++ b/iconv/gconv_int.h @@ -309,6 +309,8 @@ extern int __gconv_compare_alias (const char *name1, const char *name2) __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii); +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal); +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix); __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8); __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal); @@ -327,6 +329,12 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal); only ASCII characters. */ extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned char c); +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end + of the Low Surrogate Area at [U+DC80, U+DCFF]. */ +extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned char c) + attribute_hidden; + #endif __END_DECLS diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c new file mode 100644 index 0000000000..885929baca --- /dev/null +++ b/iconv/gconv_posix.c @@ -0,0 +1,94 @@ +/* "POSIX" locale transformation functions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + + +#include + + +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end + of the Low Surrogate Area at [U+DC80, U+DCFF]. */ +wint_t +__gconv_btwoc_posix (struct __gconv_step *step, unsigned char c) +{ + if (c < 0x80) + return c; + else + return 0xdc00 + c; +} + + +/* Convert from {[0, 0x7F] => ISO 646-IRV; [0x80, 0xFF] => [U+DC80, U+DCFF]} + to the internal (UCS4-like) format. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 1 +#define MIN_NEEDED_TO 4 +#define FROM_DIRECTION 1 +#define FROM_LOOP posix_internal_loop +#define TO_LOOP posix_internal_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_posix_internal +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + if (__glibc_unlikely (*inptr > '\x7f')) \ + *((uint32_t *) outptr) = 0xdc00 + *inptr++; \ + else \ + *((uint32_t *) outptr) = *inptr++; \ + outptr += sizeof (uint32_t); \ + } +#include +#include + + +/* Convert from the internal (UCS4-like) format to + {ISO 646-IRV => [0, 0x7F]; [U+DC80, U+DCFF] => [0x80, 0xFF]}. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 4 +#define MIN_NEEDED_TO 1 +#define FROM_DIRECTION 1 +#define FROM_LOOP internal_posix_loop +#define TO_LOOP internal_posix_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_internal_posix +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + uint32_t val = *((const uint32_t *) inptr); \ + if (__glibc_unlikely ((val > 0x7f && val < 0xdc80) || val > 0xdcff)) \ + { \ + UNICODE_TAG_HANDLER (val, 4); \ + STANDARD_TO_LOOP_ERR_HANDLER (4); \ + } \ + else \ + { \ + *outptr++ = val & 0xff; \ + inptr += sizeof (uint32_t); \ + } \ + } +#define LOOP_NEED_FLAGS +#include +#include diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh index 76400cddfc..c757fb2c40 100644 --- a/iconv/tst-iconv_prog.sh +++ b/iconv/tst-iconv_prog.sh @@ -285,3 +285,46 @@ for errorcommand in "${errorarray[@]}"; do execute_test check_errtest_result done + +allbytes () +{ + for (( i = 0; i <= 255; i++ )); do + printf '\'"$(printf "%o" "$i")" + done +} + +allucs4be () +{ + for (( i = 0; i <= 127; i++ )); do + printf '\0\0\0\'"$(printf "%o" "$i")" + done + for (( i = 128; i <= 255; i++ )); do + printf '\0\0\xdc\'"$(printf "%o" "$i")" + done +} + +check_posix_result () +{ + if [ $? -eq 0 ]; then + result=PASS + else + result=FAIL + fi + + echo "$result: from \"$1\", to: \"$2\"" + + if [ "$result" != "PASS" ]; then + exit 1 + fi +} + +check_posix_encoding () +{ + eval PROG=\"$ICONV\" + allbytes | $PROG -f POSIX -t UCS-4BE | cmp -s - <(allucs4be) + check_posix_result POSIX UCS-4BE + allucs4be | $PROG -f UCS-4BE -t POSIX | cmp -s - <(allbytes) + check_posix_result UCS-4BE POSIX +} + +check_posix_encoding diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh index ddac85daa1..badce3e4ca 100755 --- a/iconvdata/tst-tables.sh +++ b/iconvdata/tst-tables.sh @@ -31,6 +31,7 @@ cat < #include #include +#include #include #include #include @@ -229,6 +230,49 @@ run_test (const char *locname) STRTEST (YESSTR, ""); STRTEST (NOSTR, ""); + for(int i = 0; i <= 0xff; ++i) + { + unsigned char bs[] = {i, 0}; + mbstate_t ctx = {}; + wchar_t wc = -1, exp = i <= 0x7f ? i : (0xdc00 + i); + size_t sz = mbrtowc(&wc, (char *) bs, 1, &ctx); + if (sz != !!i) + { + printf ("mbrtowc(%02hhx) width in locale %s wrong " + "(is %zd, should be %d)\n", *bs, locname, sz, !!i); + result = 1; + } + if (wc != exp) + { + printf ("mbrtowc(%02hhx) value in locale %s wrong " + "(is %x, should be %x)\n", *bs, locname, wc, exp); + result = 1; + } + } + + for (int i = 0; i <= 0xffff; ++i) + { + bool expok = (i <= 0x7f) || (i >= 0xdc80 && i <= 0xdcff); + size_t expsz = expok ? 1 : (size_t) -1; + unsigned char expob = expok ? (i & 0xff) : (unsigned char) -1; + + unsigned char ob = -1; + mbstate_t ctx = {}; + size_t sz = wcrtomb ((char *) &ob, i, &ctx); + if (sz != expsz) + { + printf ("wcrtomb(%x) width in locale %s wrong " + "(is %zd, should be %zd)\n", i, locname, sz, expsz); + result = 1; + } + if (ob != expob) + { + printf ("wcrtomb(%x) value in locale %s wrong " + "(is %hhx, should be %hhx)\n", i, locname, ob, expob); + result = 1; + } + } + /* Test the new locale mechanisms. */ loc = newlocale (LC_ALL_MASK, locname, NULL); if (loc == NULL) diff --git a/localedata/charmaps/POSIX b/localedata/charmaps/POSIX new file mode 100644 index 0000000000..69bdf6b485 --- /dev/null +++ b/localedata/charmaps/POSIX @@ -0,0 +1,136 @@ + POSIX + % + / +% source: cf. localedata/locales/POSIX, LC_COLLATE + +CHARMAP + /x00 NULL (NUL) + /x01 START OF HEADING (SOH) + /x02 START OF TEXT (STX) + /x03 END OF TEXT (ETX) + /x04 END OF TRANSMISSION (EOT) + /x05 ENQUIRY (ENQ) + /x06 ACKNOWLEDGE (ACK) + /x07 BELL (BEL) + /x08 BACKSPACE (BS) + /x09 CHARACTER TABULATION (HT) + /x0a LINE FEED (LF) + /x0b LINE TABULATION (VT) + /x0c FORM FEED (FF) + /x0d CARRIAGE RETURN (CR) + /x0e SHIFT OUT (SO) + /x0f SHIFT IN (SI) + /x10 DATALINK ESCAPE (DLE) + /x11 DEVICE CONTROL ONE (DC1) + /x12 DEVICE CONTROL TWO (DC2) + /x13 DEVICE CONTROL THREE (DC3) + /x14 DEVICE CONTROL FOUR (DC4) + /x15 NEGATIVE ACKNOWLEDGE (NAK) + /x16 SYNCHRONOUS IDLE (SYN) + /x17 END OF TRANSMISSION BLOCK (ETB) + /x18 CANCEL (CAN) + /x19 END OF MEDIUM (EM) + /x1a SUBSTITUTE (SUB) + /x1b ESCAPE (ESC) + /x1c FILE SEPARATOR (IS4) + /x1d GROUP SEPARATOR (IS3) + /x1e RECORD SEPARATOR (IS2) + /x1f UNIT SEPARATOR (IS1) + /x20 SPACE + /x21 EXCLAMATION MARK + /x22 QUOTATION MARK + /x23 NUMBER SIGN + /x24 DOLLAR SIGN + /x25 PERCENT SIGN + /x26 AMPERSAND + /x27 APOSTROPHE + /x28 LEFT PARENTHESIS + /x29 RIGHT PARENTHESIS + /x2a ASTERISK + /x2b PLUS SIGN + /x2c COMMA + /x2d HYPHEN-MINUS + /x2e FULL STOP + /x2f SOLIDUS + /x30 DIGIT ZERO + /x31 DIGIT ONE + /x32 DIGIT TWO + /x33 DIGIT THREE + /x34 DIGIT FOUR + /x35 DIGIT FIVE + /x36 DIGIT SIX + /x37 DIGIT SEVEN + /x38 DIGIT EIGHT + /x39 DIGIT NINE + /x3a COLON + /x3b SEMICOLON + /x3c LESS-THAN SIGN + /x3d EQUALS SIGN + /x3e GREATER-THAN SIGN + /x3f QUESTION MARK + /x40 COMMERCIAL AT + /x41 LATIN CAPITAL LETTER A + /x42 LATIN CAPITAL LETTER B + /x43 LATIN CAPITAL LETTER C + /x44 LATIN CAPITAL LETTER D + /x45 LATIN CAPITAL LETTER E + /x46 LATIN CAPITAL LETTER F + /x47 LATIN CAPITAL LETTER G + /x48 LATIN CAPITAL LETTER H + /x49 LATIN CAPITAL LETTER I + /x4a LATIN CAPITAL LETTER J + /x4b LATIN CAPITAL LETTER K + /x4c LATIN CAPITAL LETTER L + /x4d LATIN CAPITAL LETTER M + /x4e LATIN CAPITAL LETTER N + /x4f LATIN CAPITAL LETTER O + /x50 LATIN CAPITAL LETTER P + /x51 LATIN CAPITAL LETTER Q + /x52 LATIN CAPITAL LETTER R + /x53 LATIN CAPITAL LETTER S + /x54 LATIN CAPITAL LETTER T + /x55 LATIN CAPITAL LETTER U + /x56 LATIN CAPITAL LETTER V + /x57 LATIN CAPITAL LETTER W + /x58 LATIN CAPITAL LETTER X + /x59 LATIN CAPITAL LETTER Y + /x5a LATIN CAPITAL LETTER Z + /x5b LEFT SQUARE BRACKET + /x5c REVERSE SOLIDUS + /x5d RIGHT SQUARE BRACKET + /x5e CIRCUMFLEX ACCENT + /x5f LOW LINE + /x60 GRAVE ACCENT + /x61 LATIN SMALL LETTER A + /x62 LATIN SMALL LETTER B + /x63 LATIN SMALL LETTER C + /x64 LATIN SMALL LETTER D + /x65 LATIN SMALL LETTER E + /x66 LATIN SMALL LETTER F + /x67 LATIN SMALL LETTER G + /x68 LATIN SMALL LETTER H + /x69 LATIN SMALL LETTER I + /x6a LATIN SMALL LETTER J + /x6b LATIN SMALL LETTER K + /x6c LATIN SMALL LETTER L + /x6d LATIN SMALL LETTER M + /x6e LATIN SMALL LETTER N + /x6f LATIN SMALL LETTER O + /x70 LATIN SMALL LETTER P + /x71 LATIN SMALL LETTER Q + /x72 LATIN SMALL LETTER R + /x73 LATIN SMALL LETTER S + /x74 LATIN SMALL LETTER T + /x75 LATIN SMALL LETTER U + /x76 LATIN SMALL LETTER V + /x77 LATIN SMALL LETTER W + /x78 LATIN SMALL LETTER X + /x79 LATIN SMALL LETTER Y + /x7a LATIN SMALL LETTER Z + /x7b LEFT CURLY BRACKET + /x7c VERTICAL LINE + /x7d RIGHT CURLY BRACKET + /x7e TILDE + /x7f DELETE (DEL) +.. /x80 +END CHARMAP diff --git a/localedata/locales/POSIX b/localedata/locales/POSIX index 7ec7f1c577..45f2fa0b31 100644 --- a/localedata/locales/POSIX +++ b/localedata/locales/POSIX @@ -97,6 +97,20 @@ END LC_CTYPE LC_COLLATE % This is the POSIX Locale definition for the LC_COLLATE category. % The order is the same as in the ASCII code set. +% Values above () inserted in order, per Issue 7 TC2, +% XBD, 7.3.2, LC_COLLATE Category in the POSIX Locale: +% > All characters not explicitly listed here shall be inserted +% > in the character collation order after the listed characters +% > and shall be assigned unique primary weights. If the listed +% > characters have ASCII encoding, the other characters shall +% > be in ascending order according to their coded character set values +% Since Issue 7 TC2 (XBD, 6.2 Character Encoding): +% > The POSIX locale shall contain 256 single-byte characters [...] +% (cf. bug 663, 674). +% this is in contrast to previous issues, which limited the POSIX +% locale to the Portable Character Set (7-bit ASCII). +% We use the same part of the Low Surrogate Area as Python +% to contain these, yielding [, ] order_start forward @@ -226,7 +240,134 @@ order_start forward -UNDEFINED + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + order_end % END LC_COLLATE diff --git a/localedata/tst-c-utf8-consistency.c b/localedata/tst-c-utf8-consistency.c index 1625e4dd0b..bd2f56834c 100644 --- a/localedata/tst-c-utf8-consistency.c +++ b/localedata/tst-c-utf8-consistency.c @@ -253,7 +253,7 @@ one_pass (void) TEST_COMPARE_STRING_WIDE (wstr (_NL_W_DATE_FMT), wstr_utf8 (_NL_W_DATE_FMT)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_TIME_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_TIME_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_TIME_CODESET), "UTF-8"); TEST_COMPARE_STRING (str (ALTMON_1), str_utf8 (ALTMON_1)); @@ -321,11 +321,11 @@ one_pass (void) wstr_utf8 (_NL_WABALTMON_12)); /* LC_COLLATE. Mostly untested, only expected differences. */ - TEST_COMPARE_STRING (str (_NL_COLLATE_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_COLLATE_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_COLLATE_CODESET), "UTF-8"); /* LC_CTYPE. Mostly untested, only expected differences. */ - TEST_COMPARE_STRING (str (CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (CODESET), "UTF-8"); /* LC_MONETARY. */ @@ -401,7 +401,7 @@ one_pass (void) TEST_COMPARE (word (_NL_MONETARY_THOUSANDS_SEP_WC), word_utf8 (_NL_MONETARY_THOUSANDS_SEP_WC)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_MONETARY_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_MONETARY_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_MONETARY_CODESET), "UTF-8"); /* LC_NUMERIC. */ @@ -416,7 +416,7 @@ one_pass (void) TEST_COMPARE (word (_NL_NUMERIC_THOUSANDS_SEP_WC), word_utf8 (_NL_NUMERIC_THOUSANDS_SEP_WC)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_NUMERIC_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_NUMERIC_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_NUMERIC_CODESET), "UTF-8"); /* LC_MESSAGES. */ @@ -426,7 +426,7 @@ one_pass (void) TEST_COMPARE_STRING (str (YESSTR), str_utf8 (YESSTR)); TEST_COMPARE_STRING (str (NOSTR), str_utf8 (NOSTR)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_MESSAGES_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_MESSAGES_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_MESSAGES_CODESET), "UTF-8"); /* LC_PAPER. */ @@ -434,7 +434,7 @@ one_pass (void) TEST_COMPARE (word (_NL_PAPER_HEIGHT), word_utf8 (_NL_PAPER_HEIGHT)); TEST_COMPARE (word (_NL_PAPER_WIDTH), word_utf8 (_NL_PAPER_WIDTH)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_PAPER_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_PAPER_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_PAPER_CODESET), "UTF-8"); /* LC_NAME. */ @@ -452,7 +452,7 @@ one_pass (void) TEST_COMPARE_STRING (str (_NL_NAME_NAME_MS), str_utf8 (_NL_NAME_NAME_MS)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_NAME_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_NAME_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_NAME_CODESET), "UTF-8"); /* LC_ADDRESS. */ @@ -482,7 +482,7 @@ one_pass (void) TEST_COMPARE_STRING (str (_NL_ADDRESS_LANG_LIB), str_utf8 (_NL_ADDRESS_LANG_LIB)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_ADDRESS_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_ADDRESS_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_ADDRESS_CODESET), "UTF-8"); /* LC_TELEPHONE. */ @@ -496,7 +496,7 @@ one_pass (void) TEST_COMPARE_STRING (str (_NL_TELEPHONE_INT_PREFIX), str_utf8 (_NL_TELEPHONE_INT_PREFIX)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_TELEPHONE_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_TELEPHONE_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_TELEPHONE_CODESET), "UTF-8"); /* LC_MEASUREMENT. */ @@ -504,7 +504,7 @@ one_pass (void) TEST_COMPARE (byte (_NL_MEASUREMENT_MEASUREMENT), byte_utf8 (_NL_MEASUREMENT_MEASUREMENT)); /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_MEASUREMENT_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_MEASUREMENT_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_MEASUREMENT_CODESET), "UTF-8"); /* LC_IDENTIFICATION is skipped since C.UTF-8 is distinct from C. */ @@ -512,7 +512,7 @@ one_pass (void) /* _NL_IDENTIFICATION_CATEGORY cannot be tested because it is a string array. */ /* Expected difference. */ - TEST_COMPARE_STRING (str (_NL_IDENTIFICATION_CODESET), "ANSI_X3.4-1968"); + TEST_COMPARE_STRING (str (_NL_IDENTIFICATION_CODESET), "POSIX"); TEST_COMPARE_STRING (str_utf8 (_NL_IDENTIFICATION_CODESET), "UTF-8"); } diff --git a/stdio-common/Makefile b/stdio-common/Makefile index 8871ec7668..bbfd793f29 100644 --- a/stdio-common/Makefile +++ b/stdio-common/Makefile @@ -360,6 +360,7 @@ $(objpfx)test-vfprintf.out: $(gen-locales) $(objpfx)tst-grouping.out: $(gen-locales) $(objpfx)tst-grouping2.out: $(gen-locales) $(objpfx)tst-grouping_iterator.out: $(gen-locales) +$(objpfx)tst-printf-bz25691-mem.out: $(gen-locales) $(objpfx)tst-sprintf.out: $(gen-locales) $(objpfx)tst-sscanf.out: $(gen-locales) $(objpfx)tst-swprintf.out: $(gen-locales) diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-bz25691.c index 44e9ea7d9d..c887b9962f 100644 --- a/stdio-common/tst-printf-bz25691.c +++ b/stdio-common/tst-printf-bz25691.c @@ -30,6 +30,8 @@ static int do_test (void) { + setlocale(LC_CTYPE, "C.UTF-8"); + mtrace (); /* For 's' conversion specifier with 'l' modifier the array must be diff --git a/wcsmbs/wcsmbsload.c b/wcsmbs/wcsmbsload.c index 7b338b6775..86666e8231 100644 --- a/wcsmbs/wcsmbsload.c +++ b/wcsmbs/wcsmbsload.c @@ -33,10 +33,10 @@ static const struct __gconv_step to_wc = .__shlib_handle = NULL, .__modname = NULL, .__counter = INT_MAX, - .__from_name = (char *) "ANSI_X3.4-1968//TRANSLIT", + .__from_name = (char *) "POSIX", .__to_name = (char *) "INTERNAL", - .__fct = __gconv_transform_ascii_internal, - .__btowc_fct = __gconv_btwoc_ascii, + .__fct = __gconv_transform_posix_internal, + .__btowc_fct = __gconv_btwoc_posix, .__init_fct = NULL, .__end_fct = NULL, .__min_needed_from = 1, @@ -53,8 +53,8 @@ static const struct __gconv_step to_mb = .__modname = NULL, .__counter = INT_MAX, .__from_name = (char *) "INTERNAL", - .__to_name = (char *) "ANSI_X3.4-1968//TRANSLIT", - .__fct = __gconv_transform_internal_ascii, + .__to_name = (char *) "POSIX", + .__fct = __gconv_transform_internal_posix, .__btowc_fct = NULL, .__init_fct = NULL, .__end_fct = NULL, @@ -67,7 +67,9 @@ static const struct __gconv_step to_mb = }; -/* For the default locale we only have to handle ANSI_X3.4-1968. */ +/* The default/"POSIX"/"C" locale is an 8-bit-clean mapping + with ANSI_X3.4-1968 in the first 128 characters; + we lift the remaining bytes by . */ const struct gconv_fcts __wcsmbs_gconv_fcts_c = { .towc = (struct __gconv_step *) &to_wc,