Message ID | 20210428130033.3196848-5-carlos@redhat.com |
---|---|
State | New |
Headers | show |
Series | Add new C.UTF-8 locale (Bug 17318) | expand |
* Carlos O'Donell: > We add a new C.UTF-8 locale. This locale is not builtin to glibc, but > is provided as a distinct locale. The locale provides full support > for UTF-8 and this includes full code point sorting via collation > (excludes surrogates). Unfortuantely given the present implementation > in glibc this results in 28MiB of LC_COLLATE data for all possible > Unicode code points. Future improvements may reduce this size. Such > improvements likely require a shortcut for the collation data that > relies on C.UTF-8 single-byte sorting being equivalent to strcmp. > > The new locale is NOT added to SUPPORTED. Minimal test data for > specific code points (minus those not supported by collate-test) is > provided in C.UTF-8.in, and this verifies code point sorting is > working reasonably across the range. > > The next step is to reduce LC_COLLATE to a manageable size before we > enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can > add ~5-7 minutes to the locale testing (collate-test, and xfrm-test > twice) so we don't enable full testing of all code points until we can > parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal > test data passes cleanly. Can you compare this locale with what is in Fedora and Debian, for the non-collaction/CTYPE aspects? Are there other distributions which ship a downstream C.UTF-8 locale? Thanks, Florian
On 4/29/21 10:13 AM, Florian Weimer wrote: > * Carlos O'Donell: > >> We add a new C.UTF-8 locale. This locale is not builtin to glibc, but >> is provided as a distinct locale. The locale provides full support >> for UTF-8 and this includes full code point sorting via collation >> (excludes surrogates). Unfortuantely given the present implementation >> in glibc this results in 28MiB of LC_COLLATE data for all possible >> Unicode code points. Future improvements may reduce this size. Such >> improvements likely require a shortcut for the collation data that >> relies on C.UTF-8 single-byte sorting being equivalent to strcmp. >> >> The new locale is NOT added to SUPPORTED. Minimal test data for >> specific code points (minus those not supported by collate-test) is >> provided in C.UTF-8.in, and this verifies code point sorting is >> working reasonably across the range. >> >> The next step is to reduce LC_COLLATE to a manageable size before we >> enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can >> add ~5-7 minutes to the locale testing (collate-test, and xfrm-test >> twice) so we don't enable full testing of all code points until we can >> parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal >> test data passes cleanly. > > Can you compare this locale with what is in Fedora and Debian, for the > non-collaction/CTYPE aspects? Oh, doing this review in more detail for you found a potential defect. Thank you for encouraging a more detailed review. I see that C has the first work day as Monday, but in C.UTF-8 we have switched to Sunday, possibly by accident, and my initial review didn't catch this. I'll spin a v5 which is also going to be smaller after this patch: https://sourceware.org/pipermail/libc-alpha/2021-April/125595.html. Debian (sid, 2.31-11) vs Upstream: - LC_IDENTIFICATION, contains old date, maintainer @debian address etc. - No substantive differences. - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO. - Upstream C.UTF-8 includes no transliteration, all characters pass through because UTF-8 supports all such characters. - LC_COLLATE, split ranges differently, *includes* surrogates, uses UNDEFINED correctly. - Upstream C.UTF-8 *excludes* surrogates, but otherwise covers same set. - LC_MONETARY, copy "POSIX" - Upstream C.UTF-8 explicitly defines fields, difference in 'negative_sign' where upstream will use "-" and POSIX uses "". This aligns with existing builtin C locale. - LC_NUMERIC, copy "POSIX" - Upstream C.UTF-8 explicitly defines fields, no difference from POSIX. - LC_TIME, first_workday is 2 (otherwise the same) - Upstream C.UTF-8 set first_workday to 1, this is a bug my patch. - LC_MESSAGES, only defines yesexpr and noexpr. - Upstream C.UTF-8 defines yesexpr, noexpr, yesstr and nostr. Superset of data. - LC_PAPER, copy "i18n" - Upstream C.UTF-8 explicitly defines fields, no differences. - LC_NAME, copy "i18n" - Upstream C.UTF-8 explicitly defines fields, no differences. - LC_ADDRESS, copy "i18n" - Upstream C.UTF-8 explicitly defines fields, no differences. - LC_TELEPHONE, defines tel_int_fmt. - Upstream C.UTF-8 explicitly defines tel_int_fmt, no differences. - LC_MEASUREMENT, copy "i18n" - Upstream C.UTF-8 explicitly defines measurement, no differences. > Are there other distributions which ship a downstream C.UTF-8 locale? Yes, Gentoo. I spoke to Andreas Huettel from the gentoo-toolchain team and they are using Mike Fabian's original C.UTF-8 which is harmonized and identical (including the first_workday bug) to what I'm proposing. I think it would be safe for Debian, Ubuntu, Gentoo, Fedora, CentOS Stream, and RHEL to switch to the new C.UTF-8 locale from upstream.
On 4/29/21 4:05 PM, Carlos O'Donell wrote: > - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO. > - Upstream C.UTF-8 includes no transliteration, all characters pass > through because UTF-8 supports all such characters. It turns out that this is related to bug 26984. I was wrong too, the C locale has a builtin set of ~1600 transliterations that it uses internally (I even reviewed a patch for that you committed). I had completely forgotten about this internal detail. This transliteration affects converters ability to use //TRANSLIT, and so I think we should include all the netural transliterations e.g. translit_start include "translit_neutral";"" translit_end This makes things *better* with respect to harmonization with Debian/Ubuntu. Thoughts? In summary: - POSIX says nothing about transliteration. - C/POSIX already includes a partial set of ~1600 translit entries, and they are largely incomplete. It would be nice to harmonize them with the proper translit_neutral set. - C.UTF-8 including translit_neutral would bring in ~25,000 translit rules for conversions from UTF-8 to other charmaps. This would be a superset of those offered by C/POSIX. - Fixing C/POSIX is another issue.
* Carlos O'Donell: > On 4/29/21 4:05 PM, Carlos O'Donell wrote: >> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO. >> - Upstream C.UTF-8 includes no transliteration, all characters pass >> through because UTF-8 supports all such characters. > > It turns out that this is related to bug 26984. > > I was wrong too, the C locale has a builtin set of ~1600 transliterations that > it uses internally (I even reviewed a patch for that you committed). > I had completely forgotten about this internal detail. > > This transliteration affects converters ability to use //TRANSLIT, and so I > think we should include all the netural transliterations e.g. > > translit_start > include "translit_neutral";"" > translit_end > > This makes things *better* with respect to harmonization with Debian/Ubuntu. > > Thoughts? > > In summary: > - POSIX says nothing about transliteration. > - C/POSIX already includes a partial set of ~1600 translit entries, and they > are largely incomplete. It would be nice to harmonize them with the proper > translit_neutral set. > - C.UTF-8 including translit_neutral would bring in ~25,000 translit rules > for conversions from UTF-8 to other charmaps. This would be a superset of > those offered by C/POSIX. > - Fixing C/POSIX is another issue. I'm in favor of including those transliterations. Thanks, Florian
On 4/30/21 2:20 PM, Florian Weimer wrote: > * Carlos O'Donell: > >> On 4/29/21 4:05 PM, Carlos O'Donell wrote: >>> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO. >>> - Upstream C.UTF-8 includes no transliteration, all characters pass >>> through because UTF-8 supports all such characters. >> >> It turns out that this is related to bug 26984. >> >> I was wrong too, the C locale has a builtin set of ~1600 transliterations that >> it uses internally (I even reviewed a patch for that you committed). >> I had completely forgotten about this internal detail. >> >> This transliteration affects converters ability to use //TRANSLIT, and so I >> think we should include all the netural transliterations e.g. >> >> translit_start >> include "translit_neutral";"" >> translit_end >> >> This makes things *better* with respect to harmonization with Debian/Ubuntu. >> >> Thoughts? >> >> In summary: >> - POSIX says nothing about transliteration. >> - C/POSIX already includes a partial set of ~1600 translit entries, and they >> are largely incomplete. It would be nice to harmonize them with the proper >> translit_neutral set. >> - C.UTF-8 including translit_neutral would bring in ~25,000 translit rules >> for conversions from UTF-8 to other charmaps. This would be a superset of >> those offered by C/POSIX. >> - Fixing C/POSIX is another issue. > > I'm in favor of including those transliterations. Thanks. I'll spin a v5, test and repost. I need to look at the size impact of the additional transliterations.
diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in new file mode 100644 index 0000000000..b8764a4e04 --- /dev/null +++ b/localedata/C.UTF-8.in @@ -0,0 +1,156 @@ + ; <U1> + ; <U2> + ; <U3> + ; <U4> + ; <U5> + ; <U6> + ; <U7> + ; <U8> + ; <UE> + ; <UF> + ; <U10> + ; <U11> + ; <U12> + ; <U13> + ; <U14> + ; <U15> + ; <U16> + ; <U17> + ; <U18> + ; <U19> + ; <U1A> + ; <U1B> + ; <U1C> + ; <U1D> + ; <U1E> + ; <U1F> +! ; <U21> +" ; <U22> +# ; <U23> +$ ; <U24> +% ; <U25> +& ; <U26> +' ; <U27> +) ; <U29> +* ; <U2A> ++ ; <U2B> +, ; <U2C> +- ; <U2D> +. ; <U2E> +/ ; <U2F> +0 ; <U30> +1 ; <U31> +2 ; <U32> +3 ; <U33> +4 ; <U34> +5 ; <U35> +6 ; <U36> +7 ; <U37> +8 ; <U38> +9 ; <U39> +< ; <U3C> += ; <U3D> +> ; <U3E> +? ; <U3F> +@ ; <U40> +A ; <U41> +B ; <U42> +C ; <U43> +D ; <U44> +E ; <U45> +F ; <U46> +G ; <U47> +H ; <U48> +I ; <U49> +J ; <U4A> +K ; <U4B> +L ; <U4C> +M ; <U4D> +N ; <U4E> +O ; <U4F> +P ; <U50> +Q ; <U51> +R ; <U52> +S ; <U53> +T ; <U54> +U ; <U55> +V ; <U56> +W ; <U57> +X ; <U58> +Y ; <U59> +Z ; <U5A> +[ ; <U5B> +\ ; <U5C> +] ; <U5D> +^ ; <U5E> +_ ; <U5F> +` ; <U60> +a ; <U61> +b ; <U62> +c ; <U63> +d ; <U64> +e ; <U65> +f ; <U66> +g ; <U67> +h ; <U68> +i ; <U69> +j ; <U6A> +k ; <U6B> +l ; <U6C> +m ; <U6D> +n ; <U6E> +o ; <U6F> +p ; <U70> +q ; <U71> +r ; <U72> +s ; <U73> +t ; <U74> +u ; <U75> +v ; <U76> +w ; <U77> +x ; <U78> +y ; <U79> +z ; <U7A> +{ ; <U7B> +| ; <U7C> +} ; <U7D> +~ ; <U7E> + ; <U7F> + ; <U80> +ÿ ; <UFF> +Ā ; <U100> + ; <UFFF> +က ; <U1000> + ; <UFFFF> +𐀀 ; <U10000> + ; <U1FFFF> +𠀀 ; <U20000> + ; <U2FFFF> +𰀀 ; <U30000> + ; <U3FFFE> + ; <U40000> + ; <U4FFFF> + ; <U50000> + ; <U5FFFF> + ; <U60000> + ; <U6FFFF> + ; <U70000> + ; <U7FFFF> + ; <U80000> + ; <U8FFFF> + ; <U90000> + ; <U9FFFF> + ; <UA0000> + ; <UAFFFF> + ; <UB0000> + ; <UBFFFF> + ; <UC0001> + ; <UCFFCC> + ; <UD000E> + ; <UDFFFF> + ; <UE0001> + ; <UEFFFF> + ; <UF0001> + ; <UFFFFF> + ; <U100001> + ; <U10FFFF> diff --git a/localedata/Makefile b/localedata/Makefile index 14e04cd3c5..38017f2c4c 100644 --- a/localedata/Makefile +++ b/localedata/Makefile @@ -47,6 +47,7 @@ test-input := \ bg_BG.UTF-8 \ br_FR.UTF-8 \ bs_BA.UTF-8 \ + C.UTF-8 \ ckb_IQ.UTF-8 \ cmn_TW.UTF-8 \ crh_UA.UTF-8 \ @@ -206,6 +207,7 @@ LOCALES := \ bg_BG.UTF-8 \ br_FR.UTF-8 \ bs_BA.UTF-8 \ + C.UTF-8 \ ckb_IQ.UTF-8 \ cmn_TW.UTF-8 \ crh_UA.UTF-8 \ diff --git a/localedata/locales/C b/localedata/locales/C new file mode 100644 index 0000000000..67e5bd913b --- /dev/null +++ b/localedata/locales/C @@ -0,0 +1,188 @@ +escape_char / +comment_char % +% Locale for C locale in UTF-8 + +LC_IDENTIFICATION +title "C locale" +source "" +address "" +contact "" +email "bug-glibc-locales@gnu.org" +tel "" +fax "" +language "" +territory "" +revision "2.0" +date "2020-06-28" +category "i18n:2012";LC_IDENTIFICATION +category "i18n:2012";LC_CTYPE +category "i18n:2012";LC_COLLATE +category "i18n:2012";LC_TIME +category "i18n:2012";LC_NUMERIC +category "i18n:2012";LC_MONETARY +category "i18n:2012";LC_MESSAGES +category "i18n:2012";LC_PAPER +category "i18n:2012";LC_NAME +category "i18n:2012";LC_ADDRESS +category "i18n:2012";LC_TELEPHONE +category "i18n:2012";LC_MEASUREMENT +END LC_IDENTIFICATION + +LC_CTYPE + +% Include only the i18n character type classes without any of the +% transliteration that i18n uses by default. The C locale has no +% transliteration and passes all characters through unchanged. +copy "i18n_ctype" + +END LC_CTYPE + +% One rule, sort forward, for all Unicode scalar values to give +% code point order sorting for Unicode (excludes surrogates +% which are not in the UTF-8 character map). +LC_COLLATE +order_start forward +<U00000000> +.. +<U0000D7FF> +% Exclude surrogates <UD800> to <UDFFF> from collation. +<U0000E000> +.. +<U0010FFFF> +UNDEFINED +order_end +END LC_COLLATE + +LC_MONETARY + +% This is the 14652 i18n fdcc-set definition for the LC_MONETARY +% category (except for the int_curr_symbol and currency_symbol, they are +% empty in the 14652 i18n fdcc-set definition and also empty in +% glibc/locale/C-monetary.c.). +int_curr_symbol "" +currency_symbol "" +mon_decimal_point "." +mon_thousands_sep "" +mon_grouping -1 +positive_sign "" +negative_sign "-" +int_frac_digits -1 +frac_digits -1 +p_cs_precedes -1 +int_p_sep_by_space -1 +p_sep_by_space -1 +n_cs_precedes -1 +int_n_sep_by_space -1 +n_sep_by_space -1 +p_sign_posn -1 +n_sign_posn -1 +% +END LC_MONETARY + +LC_NUMERIC +% This is the POSIX Locale definition for +% the LC_NUMERIC category. +% +decimal_point "." +thousands_sep "" +grouping -1 +END LC_NUMERIC + +LC_TIME +% This is the POSIX Locale definition for the LC_TIME category with the +% exception that time is per ISO 8601 and 24-hour. +% +% Abbreviated weekday names (%a) +abday "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat" + +% Full weekday names (%A) +day "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/ + "Friday";"Saturday" + +% Abbreviated month names (%b) +abmon "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/ + "Oct";"Nov";"Dec" + +% Full month names (%B) +mon "January";"February";"March";"April";"May";"June";"July";/ + "August";"September";"October";"November";"December" + +% Week description, consists of three fields: +% 1. Number of days in a week. +% 2. Gregorian date that is a first weekday (19971130 for Sunday, 19971201 for Monday). +% 3. The weekday number to be contained in the first week of the year. +% +% ISO 8601 conforming applications should use the values 7, 19971201 (a +% Monday), and 4 (Thursday), respectively. +week 7;19971201;4 +first_weekday 1 +first_workday 1 + +% Appropriate date and time representation (%c) +d_t_fmt "%a %b %e %H:%M:%S %Y" + +% Appropriate date representation (%x) +d_fmt "%m/%d/%y" + +% Appropriate time representation (%X) +t_fmt "%H:%M:%S" + +% Appropriate AM/PM time representation (%r) +t_fmt_ampm "%I:%M:%S %p" + +% Equivalent of AM/PM (%p) +am_pm "AM";"PM" + +% Appropriate date representation (date(1)) "%a %b %e %H:%M:%S %Z %Y" +date_fmt "%a %b %e %H:%M:%S %Z %Y" +END LC_TIME + +LC_MESSAGES +% This is the POSIX Locale definition for +% the LC_NUMERIC category. +% +yesexpr "^[yY]" +noexpr "^[nN]" +yesstr "Yes" +nostr "No" +END LC_MESSAGES + +LC_PAPER +% This is the ISO/IEC 14652 "i18n" definition for +% the LC_PAPER category. +% (A4 paper, this is also used in the built in C/POSIX +% locale in glibc/locale/C-paper.c) +height 297 +width 210 +END LC_PAPER + +LC_NAME +% This is the ISO/IEC 14652 "i18n" definition for +% the LC_NAME category. +% (also used in the built in C/POSIX locale in glibc/locale/C-name.c) +name_fmt "%p%t%g%t%m%t%f" +END LC_NAME + +LC_ADDRESS +% This is the ISO/IEC 14652 "i18n" definition for +% the LC_ADDRESS category. +% (also used in the built in C/POSIX locale in glibc/locale/C-address.c) +postal_fmt "%a%N%f%N%d%N%b%N%s %h %e %r%N%C-%z %T%N%c%N" +END LC_ADDRESS + +LC_TELEPHONE +% This is the ISO/IEC 14652 "i18n" definition for +% the LC_TELEPHONE category. +% "+%c %a %l" +tel_int_fmt "+%c %a %l" +% (also used in the built in C/POSIX locale in glibc/locale/C-telephone.c) +END LC_TELEPHONE + +LC_MEASUREMENT +% This is the ISO/IEC 14652 "i18n" definition for +% the LC_MEASUREMENT category. +% (same as in the built in C/POSIX locale in glibc/locale/C-measurement.c) +%metric +measurement 1 +END LC_MEASUREMENT +