Message ID | 20180703170700.9306-1-krisman@collabora.co.uk |
---|---|
Headers | show |
Series | EXT4 encoding support | expand |
On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote: > Since not every NLS tables support normalization operations, we limit > which encodings can be used by an ext4 volume. Right now, ascii and > utf8n are supported, utf8n being a new version of the utf8 charset, but > with normalization support using the SGI patches, which are part of this > patchset. Why do we need to have to distinguish between utf8n vs utf8? Why can't we just add normalization to existing utf8 character set? What would break? Also, do we *have* to support only encodings that have normalization? It's pointless w/o case-folding support (which is not in this patch series), but what would happen if we supported case-folding w/o normalization? - Ted
"Theodore Y. Ts'o" <tytso@mit.edu> writes: > On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote: >> Since not every NLS tables support normalization operations, we limit >> which encodings can be used by an ext4 volume. Right now, ascii and >> utf8n are supported, utf8n being a new version of the utf8 charset, but >> with normalization support using the SGI patches, which are part of this >> patchset. > > Why do we need to have to distinguish between utf8n vs utf8? Why > can't we just add normalization to existing utf8 character set? What > would break? The reason I made it separate charsets is that if we ever decide to support normalization on filesystems that already implement some support for uftf8 already (fat, for instance), we don't want to change the behavior of existing disks, where strings wouldn't be normalized, since that would be an ABI breakage. By separating the non-normalized and normalized version of the charset, we let the user decide, or at least the superblock inform whether the disk wants normalization or not by setting the right charset. > > Also, do we *have* to support only encodings that have normalization? > It's pointless w/o case-folding support (which is not in this patch > series), but what would happen if we supported case-folding w/o > normalization? We could fallback the normalization operation to the string identity, which would allow us to support any charset available in NLS. My concern with that is if we someday add normalization to any other charset, we'd breaking the compatibility of fs that had it, similarly to the reason I implemented utf8n separately from utf8. Also there is the small issue of assigning magic numbers for the encodings in the superblock, but this is easy to fix. If, for some reason, this is not a problem in this case, I can change it in the next iteration, to merge utf8n and utf8, and also allow other charsets.
On Thu, Jul 12, 2018 at 01:16:15PM -0400, Gabriel Krisman Bertazi wrote: > "Theodore Y. Ts'o" <tytso@mit.edu> writes: > > > On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote: > >> Since not every NLS tables support normalization operations, we limit > >> which encodings can be used by an ext4 volume. Right now, ascii and > >> utf8n are supported, utf8n being a new version of the utf8 charset, but > >> with normalization support using the SGI patches, which are part of this > >> patchset. > > > > Why do we need to have to distinguish between utf8n vs utf8? Why > > can't we just add normalization to existing utf8 character set? What > > would break? > > The reason I made it separate charsets is that if we ever decide to > support normalization on filesystems that already implement some support > for uftf8 already (fat, for instance), we don't want to change the > behavior of existing disks, where strings wouldn't be normalized, since > that would be an ABI breakage. By separating the non-normalized and > normalized version of the charset, we let the user decide, or at least > the superblock inform whether the disk wants normalization or not by > setting the right charset. Hmm, so there's a philosophical question hiding here, I think. Does a file system which is encoding aware have to do normalization? Or more generally what does it *mean* for a file system to be encoding aware? There are all things that a file system could do given that it is encoding aware and the file system is declared to be using a particular encoding: A) Filenames that are "invalid" with respect to an encoding are rejected B) Filenames are normalized before they are stored in the directory C) Filenames are compared in a normalization-insensitive manner D) Filenames are forced to a case before they are stored in a directory E) Filenames are compared in a case-insensitive manner Some of these behaviors are orthogonal; that is, you could do A, or you could do C, or you could do both, or you could do neither. And some of these behaviors can be format-dependent (e.g., you can't change an encoding without running some kind of off-line fsck-like program across the entire file systems); and some of them are not format-dependent (and so could be overriden by a mount option). So maybe we need to talk about is having a feature called EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in the superblock. One is the encoding identifier (and 8 or 16 bits is probably *plenty*), and the other is the "encoding flags" field. Some of these flags might specify an encoding --- e.g., the file system supports normalization- and/or case- insensitive lookups in an efficient way by normalizing the string before calculating the dir_index hash. Some of these might specify the default behavior (e.g., case-insensitive or normalization-insensitive) file lookups if not overridden by a mount option. This assumes that normalization and case sensitivity are completely orthogonal. The other thing is there seems to be some debate (and Apple isn't even consistent over time) over what kind of normalization is considered "best" or "correct". e.g., NFD, NFC, NFKD, NFKC. And if you want to export the file system over APFS, it might make a difference which one you use. (This is usually the point where some people will assert that teaching everyone in the world English really *would* be easier than supporting full I18N. :-) Is this something we can or should consider when deciding what we want to support in Linux long-term? > If, for some reason, this is not a problem in this case, I can change it > in the next iteration, to merge utf8n and utf8, and also allow other > charsets. ... and what I'm really asking is do we really want to be specifying whether or not normalization is a Thing as a property of the encoding, or a property of the file system (or object, or document) that uses that particular encoding? - Ted
"Theodore Y. Ts'o" <tytso@mit.edu> writes: > So maybe we need to talk about is having a feature called > EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in > the superblock. One is the encoding identifier (and 8 or 16 bits is > probably *plenty*), and the other is the "encoding flags" field. The current patchset makes encoding an INCOMPAT feature, but I'm using 32 bits for the encoding identifier. I will change it to 16 bits in the next iteration of the patch. > Some of these flags might specify an encoding --- e.g., the file > system supports normalization- and/or case- insensitive lookups in an > efficient way by normalizing the string before calculating the > dir_index hash. Some of these might specify the default behavior > (e.g., case-insensitive or normalization-insensitive) file lookups if > not overridden by a mount option. I like the idea of encoding flags for selecting the default for case/normalization -sensitiveness. But I'm not really sure about a flag stating support for normalized hashes. It could be made redundant with the feature/casefold flag itself, if we make tune2fs or similar rehash the disk when enabling/disabling the encoding feature flag. Feature flag is set -> Hash(normalization(x)) Feature flag and parent inode casefold flag are set -> Hash(casefold(x)) The casefold superblock flag would state whether the casefold inode flags defaults to true or false. > This assumes that normalization and case sensitivity are completely > orthogonal. I'm thinking of casefolding as a special case of the normalization problem, just because its semantics are interesting for users. In fact, it could be seen as just a different normalization function, from the implementation point of view. So, it is not completely orthogonal per-se, but it also deserves some special stuff attention be more useful, like being per-directory, and to carrying its on activation flags. > The other thing is there seems to be some debate (and Apple isn't even > consistent over time) over what kind of normalization is considered > "best" or "correct". e.g., NFD, NFC, NFKD, NFKC. And if you want to > export the file system over APFS, it might make a difference which one > you use. (This is usually the point where some people will assert > that teaching everyone in the world English really *would* be easier > than supporting full I18N. :-) Is this something we can or should > consider when deciding what we want to support in Linux long-term? Since the implementation is normalization-preserving on-disk, isn't this something that can be changed in the future if it is ever needed? Provided we can rehash the dentries if we need to change the normalization, a flag in the superblock, stating what normalization method is used, should suffice if we ever want to support other normalization methods. I have to say, It is not in my plans to support anything other than NFKD. :) > ... and what I'm really asking is do we really want to be specifying > whether or not normalization is a Thing as a property of the encoding, > or a property of the file system (or object, or document) that uses > that particular encoding? I see normalization as an inherent property of the encoding, since, for the user equivalent strings should mean the same thing in the natural language. But I see the point of filesystems wanting to ignore normalization. I am pending towards the permissive route, where this can be enabled/disabled when loading a NLS charset table. This way we can merge utf8 and utf8n, and satisfy the normalization case, while keeping compatibility with older users, What do you think?