[00/20] EXT4 encoding support

Message ID	20180703170700.9306-1-krisman@collabora.co.uk
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> sender: krisman) with ESMTPSA id 775AE2605C7 From: Gabriel Krisman Bertazi <krisman@collabora.co.uk> To: tytso@mit.edu Cc: linux-ext4@vger.kernel.org, darrick.wong@oracle.com, kernel@collabora.com, Gabriel Krisman Bertazi <krisman@collabora.co.uk> Subject: [PATCH 00/20] EXT4 encoding support Date: Tue, 3 Jul 2018 13:06:40 -0400 Message-Id: <20180703170700.9306-1-krisman@collabora.co.uk> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk
Series	EXT4 encoding support \| expand [00/20] EXT4 encoding support [01/20] nls: Wrap uni2char/char2uni callers [02/20] nls: Wrap charset field access [03/20] nls: Wrap charset hooks in ops structure [04/20] nls: Split default charset from NLS core [05/20] nls: Split struct nls_charset from struct nls_table [06/20] nls: Add support for multiple versions of an encoding [07/20] nls: Add new interface for string comparisons [08/20] nls: Let charsets define the behavior of tolower/toupper [09/20] nls: Add optional normalization and casefold hooks [10/20] nls: utf8norm: Add unicode character database files [11/20] scripts: add trie generator for UTF-8 [12/20] nls: utf8norm: Introduce code for UTF-8 normalization [13/20] nls: utf8norm: reduce the size of utf8data[] [14/20] nls: utf8norm: Integrate utf8norm code with NLS subsystem [15/20] nls: utf8norm: Introduce test module for utf8norm implementation [16/20] nls: ascii: Support casefold and normalization operations [17/20] ext4: Include encoding information in the superblock [18/20] ext4: Support encoding-aware file name lookups [19/20] vfs: Handle case-exact lookup in d_add_ci [20/20] ext4: Implement encoding-aware dcache hooks

Message ID

20180703170700.9306-1-krisman@collabora.co.uk

Headers

From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
To: tytso@mit.edu
Cc: linux-ext4@vger.kernel.org, darrick.wong@oracle.com,
	kernel@collabora.com, Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Subject: [PATCH 00/20] EXT4 encoding support
Date: Tue,  3 Jul 2018 13:06:40 -0400
Message-Id: <20180703170700.9306-1-krisman@collabora.co.uk>
Sender: linux-ext4-owner@vger.kernel.org
Precedence: bulk

Series

EXT4 encoding support | expand

Message

Gabriel Krisman Bertazi July 3, 2018, 5:06 p.m. UTC

Hi Ted,

This patchset implements encoding support as a superblock feature, valid
for the entire disk, as an intermediate step to my goal of supporting
case-insensitiveness.

The superblock records the encoding in a reserved field, but the
encoding can also be forced through the use of a mount flag called
encoding=.

Since not every NLS tables support normalization operations, we limit
which encodings can be used by an ext4 volume.  Right now, ascii and
utf8n are supported, utf8n being a new version of the utf8 charset, but
with normalization support using the SGI patches, which are part of this
patchset.

This patchset also includes the NLS changes that I am proposing, since
encoding-awareness depends on NLS and the former patchset didn't get
much review beforehand.  As usual, I did not include the source ucd
files because they would bounce in the list, but a completely functional
implementation can be found at:

    https://gitlab.collabora.com/krisman/linux.git -b ext4-ci-directory

I am also not supporting encoding with encrypted directories, given the
cost of searching encrypted directories where the diggested name is not
normalized. This means that we need to decrypt each filename beforehand,
so I decided to simply skip this for now.  If the user tries to mount
with encoding a directory that has the encryption feature, we simply
bail out saying it is not supported.

The patchset survives without failures the smoke tests of xfstests, with
the obvious exception of generic/453.  This test, which verifies that
multiple files having similar names (which would match in utf8
normalization) are not the same file, doesn't really make sense with
this patchset, since it verifies the fs is *not* encoding aware.

I have a patch adding support for encoding detection and skipping
this test on xfstests, that I can send once we agreed on the mount point
options and the superblock layout changes for this feature.

I am also CC'ing Darick for his input on utf8n and NLS.

I am not listing this as a new version of the previous NLS patchsets,
because it includes a lot more patches than the original patchset, but I left the changelogs from the previous versions for the NLS patches.

Please let me know what you think.

Gabriel Krisman Bertazi (16):
  nls: Wrap uni2char/char2uni callers
  nls: Wrap charset field access
  nls: Wrap charset hooks in ops structure
  nls: Split default charset from NLS core
  nls: Split struct nls_charset from struct nls_table
  nls: Add support for multiple versions of an encoding
  nls: Add new interface for string comparisons
  nls: Let charsets define the behavior of tolower/toupper
  nls: Add optional normalization and casefold hooks
  nls: utf8norm: Integrate utf8norm code with NLS subsystem
  nls: utf8norm: Introduce test module for utf8norm implementation
  nls: ascii: Support casefold and normalization operations
  ext4: Include encoding information in the superblock
  ext4: Support encoding-aware file name lookups
  vfs: Handle case-exact lookup in d_add_ci
  ext4: Implement encoding-aware dcache hooks

Olaf Weber (4):
  nls: utf8norm: Add unicode character database files
  scripts: add trie generator for UTF-8
  nls: utf8norm: Introduce code for UTF-8 normalization
  nls: utf8norm: reduce the size of utf8data[]

 fs/befs/linuxvfs.c                   |    8 +-
 fs/cifs/cifs_unicode.c               |   15 +-
 fs/cifs/cifsfs.c                     |    2 +-
 fs/cifs/connect.c                    |    2 +-
 fs/cifs/dir.c                        |    7 +-
 fs/dcache.c                          |   33 +-
 fs/ext4/dir.c                        |   30 +
 fs/ext4/ext4.h                       |    8 +-
 fs/ext4/namei.c                      |   60 +-
 fs/ext4/super.c                      |  123 +
 fs/fat/dir.c                         |   13 +-
 fs/fat/inode.c                       |    6 +-
 fs/fat/namei_vfat.c                  |    6 +-
 fs/hfs/super.c                       |    6 +-
 fs/hfs/trans.c                       |    9 +-
 fs/hfsplus/options.c                 |    2 +-
 fs/hfsplus/unicode.c                 |    6 +-
 fs/isofs/inode.c                     |    5 +-
 fs/isofs/joliet.c                    |    3 +-
 fs/jfs/jfs_unicode.c                 |    9 +-
 fs/jfs/super.c                       |    3 +-
 fs/nls/Kconfig                       |   13 +
 fs/nls/Makefile                      |   19 +
 fs/nls/mac-celtic.c                  |   34 +-
 fs/nls/mac-centeuro.c                |   34 +-
 fs/nls/mac-croatian.c                |   34 +-
 fs/nls/mac-cyrillic.c                |   34 +-
 fs/nls/mac-gaelic.c                  |   34 +-
 fs/nls/mac-greek.c                   |   34 +-
 fs/nls/mac-iceland.c                 |   34 +-
 fs/nls/mac-inuit.c                   |   34 +-
 fs/nls/mac-roman.c                   |   34 +-
 fs/nls/mac-romanian.c                |   34 +-
 fs/nls/mac-turkish.c                 |   34 +-
 fs/nls/nls_ascii.c                   |   67 +-
 fs/nls/nls_core.c                    |  141 ++
 fs/nls/nls_cp1250.c                  |   34 +-
 fs/nls/nls_cp1251.c                  |   34 +-
 fs/nls/nls_cp1255.c                  |   36 +-
 fs/nls/nls_cp437.c                   |   34 +-
 fs/nls/nls_cp737.c                   |   34 +-
 fs/nls/nls_cp775.c                   |   34 +-
 fs/nls/nls_cp850.c                   |   34 +-
 fs/nls/nls_cp852.c                   |   34 +-
 fs/nls/nls_cp855.c                   |   34 +-
 fs/nls/nls_cp857.c                   |   34 +-
 fs/nls/nls_cp860.c                   |   34 +-
 fs/nls/nls_cp861.c                   |   34 +-
 fs/nls/nls_cp862.c                   |   34 +-
 fs/nls/nls_cp863.c                   |   34 +-
 fs/nls/nls_cp864.c                   |   34 +-
 fs/nls/nls_cp865.c                   |   34 +-
 fs/nls/nls_cp866.c                   |   34 +-
 fs/nls/nls_cp869.c                   |   34 +-
 fs/nls/nls_cp874.c                   |   36 +-
 fs/nls/nls_cp932.c                   |   36 +-
 fs/nls/nls_cp936.c                   |   36 +-
 fs/nls/nls_cp949.c                   |   36 +-
 fs/nls/nls_cp950.c                   |   36 +-
 fs/nls/{nls_base.c => nls_default.c} |  124 +-
 fs/nls/nls_euc-jp.c                  |   29 +-
 fs/nls/nls_iso8859-1.c               |   34 +-
 fs/nls/nls_iso8859-13.c              |   34 +-
 fs/nls/nls_iso8859-14.c              |   34 +-
 fs/nls/nls_iso8859-15.c              |   34 +-
 fs/nls/nls_iso8859-2.c               |   34 +-
 fs/nls/nls_iso8859-3.c               |   34 +-
 fs/nls/nls_iso8859-4.c               |   34 +-
 fs/nls/nls_iso8859-5.c               |   34 +-
 fs/nls/nls_iso8859-6.c               |   34 +-
 fs/nls/nls_iso8859-7.c               |   34 +-
 fs/nls/nls_iso8859-9.c               |   34 +-
 fs/nls/nls_koi8-r.c                  |   34 +-
 fs/nls/nls_koi8-ru.c                 |   30 +-
 fs/nls/nls_koi8-u.c                  |   34 +-
 fs/nls/nls_utf8.c                    |   34 +-
 fs/nls/nls_utf8n-core.c              |  276 ++
 fs/nls/nls_utf8n-norm.c              |  797 ++++++
 fs/nls/nls_utf8n-selftest.c          |  307 +++
 fs/nls/ucd/README                    |   33 +
 fs/nls/utf8n.h                       |  117 +
 fs/ntfs/inode.c                      |    2 +-
 fs/ntfs/super.c                      |    6 +-
 fs/ntfs/unistr.c                     |   13 +-
 fs/udf/super.c                       |    3 +-
 fs/udf/unicode.c                     |    4 +-
 include/linux/nls.h                  |  127 +-
 scripts/Makefile                     |    1 +
 scripts/mkutf8data.c                 | 3464 ++++++++++++++++++++++++++
 89 files changed, 7018 insertions(+), 555 deletions(-)
 create mode 100644 fs/nls/nls_core.c
 rename fs/nls/{nls_base.c => nls_default.c} (89%)
 create mode 100644 fs/nls/nls_utf8n-core.c
 create mode 100644 fs/nls/nls_utf8n-norm.c
 create mode 100644 fs/nls/nls_utf8n-selftest.c
 create mode 100644 fs/nls/ucd/README
 create mode 100644 fs/nls/utf8n.h
 create mode 100644 scripts/mkutf8data.c

Comments

Theodore Ts'o July 12, 2018, 1:42 a.m. UTC | #1

On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote:
> Since not every NLS tables support normalization operations, we limit
> which encodings can be used by an ext4 volume.  Right now, ascii and
> utf8n are supported, utf8n being a new version of the utf8 charset, but
> with normalization support using the SGI patches, which are part of this
> patchset.

Why do we need to have to distinguish between utf8n vs utf8?  Why
can't we just add normalization to existing utf8 character set?  What
would break?

Also, do we *have* to support only encodings that have normalization?
It's pointless w/o case-folding support (which is not in this patch
series), but what would happen if we supported case-folding w/o
normalization?

					- Ted

Gabriel Krisman Bertazi July 12, 2018, 5:16 p.m. UTC | #2

"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote:
>> Since not every NLS tables support normalization operations, we limit
>> which encodings can be used by an ext4 volume.  Right now, ascii and
>> utf8n are supported, utf8n being a new version of the utf8 charset, but
>> with normalization support using the SGI patches, which are part of this
>> patchset.
>
> Why do we need to have to distinguish between utf8n vs utf8?  Why
> can't we just add normalization to existing utf8 character set?  What
> would break?

The reason I made it separate charsets is that if we ever decide to
support normalization on filesystems that already implement some support
for uftf8 already (fat, for instance), we don't want to change the
behavior of existing disks, where strings wouldn't be normalized, since
that would be an ABI breakage.  By separating the non-normalized and
normalized version of the charset, we let the user decide, or at least
the superblock inform whether the disk wants normalization or not by
setting the right charset.

>
> Also, do we *have* to support only encodings that have normalization?
> It's pointless w/o case-folding support (which is not in this patch
> series), but what would happen if we supported case-folding w/o
> normalization?

We could fallback the normalization operation to the string identity,
which would allow us to support any charset available in NLS.  My
concern with that is if we someday add normalization to any other
charset, we'd breaking the compatibility of fs that had it, similarly to
the reason I implemented utf8n separately from utf8.  Also there is the
small issue of assigning magic numbers for the encodings in the
superblock, but this is easy to fix.

If, for some reason, this is not a problem in this case, I can change it
in the next iteration, to merge utf8n and utf8, and also allow other
charsets.

Theodore Ts'o July 12, 2018, 9:40 p.m. UTC | #3

On Thu, Jul 12, 2018 at 01:16:15PM -0400, Gabriel Krisman Bertazi wrote:
> "Theodore Y. Ts'o" <tytso@mit.edu> writes:
> 
> > On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote:
> >> Since not every NLS tables support normalization operations, we limit
> >> which encodings can be used by an ext4 volume.  Right now, ascii and
> >> utf8n are supported, utf8n being a new version of the utf8 charset, but
> >> with normalization support using the SGI patches, which are part of this
> >> patchset.
> >
> > Why do we need to have to distinguish between utf8n vs utf8?  Why
> > can't we just add normalization to existing utf8 character set?  What
> > would break?
> 
> The reason I made it separate charsets is that if we ever decide to
> support normalization on filesystems that already implement some support
> for uftf8 already (fat, for instance), we don't want to change the
> behavior of existing disks, where strings wouldn't be normalized, since
> that would be an ABI breakage.  By separating the non-normalized and
> normalized version of the charset, we let the user decide, or at least
> the superblock inform whether the disk wants normalization or not by
> setting the right charset.

Hmm, so there's a philosophical question hiding here, I think.  Does a
file system which is encoding aware have to do normalization?  Or more
generally what does it *mean* for a file system to be encoding aware?

There are all things that a file system could do given that it is
encoding aware and the file system is declared to be using a
particular encoding:

A) Filenames that are "invalid" with respect to an encoding are rejected
B) Filenames are normalized before they are stored in the directory
C) Filenames are compared in a normalization-insensitive manner
D) Filenames are forced to a case before they are stored in a directory
E) Filenames are compared in a case-insensitive manner

Some of these behaviors are orthogonal; that is, you could do A, or
you could do C, or you could do both, or you could do neither.  And
some of these behaviors can be format-dependent (e.g., you can't
change an encoding without running some kind of off-line fsck-like
program across the entire file systems); and some of them are not
format-dependent (and so could be overriden by a mount option).

So maybe we need to talk about is having a feature called
EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in
the superblock.  One is the encoding identifier (and 8 or 16 bits is
probably *plenty*), and the other is the "encoding flags" field.

Some of these flags might specify an encoding --- e.g., the file
system supports normalization- and/or case- insensitive lookups in an
efficient way by normalizing the string before calculating the
dir_index hash.  Some of these might specify the default behavior
(e.g., case-insensitive or normalization-insensitive) file lookups if
not overridden by a mount option.

This assumes that normalization and case sensitivity are completely
orthogonal.

The other thing is there seems to be some debate (and Apple isn't even
consistent over time) over what kind of normalization is considered
"best" or "correct".  e.g., NFD, NFC, NFKD, NFKC.  And if you want to
export the file system over APFS, it might make a difference which one
you use.  (This is usually the point where some people will assert
that teaching everyone in the world English really *would* be easier
than supporting full I18N.  :-) Is this something we can or should
consider when deciding what we want to support in Linux long-term?

> If, for some reason, this is not a problem in this case, I can change it
> in the next iteration, to merge utf8n and utf8, and also allow other
> charsets.

... and what I'm really asking is do we really want to be specifying
whether or not normalization is a Thing as a property of the encoding,
or a property of the file system (or object, or document) that uses
that particular encoding?

						- Ted

Gabriel Krisman Bertazi July 18, 2018, 12:27 a.m. UTC | #4

"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> So maybe we need to talk about is having a feature called
> EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in
> the superblock.  One is the encoding identifier (and 8 or 16 bits is
> probably *plenty*), and the other is the "encoding flags" field.

The current patchset makes encoding an INCOMPAT feature, but I'm
using 32 bits for the encoding identifier.  I will change it to 16 bits
in the next iteration of the patch.

> Some of these flags might specify an encoding --- e.g., the file
> system supports normalization- and/or case- insensitive lookups in an
> efficient way by normalizing the string before calculating the
> dir_index hash.  Some of these might specify the default behavior
> (e.g., case-insensitive or normalization-insensitive) file lookups if
> not overridden by a mount option.

I like the idea of encoding flags for selecting the default for
case/normalization -sensitiveness.  But I'm not really sure about a flag
stating support for normalized hashes.  It could be made redundant with
the feature/casefold flag itself, if we make tune2fs or similar rehash
the disk when enabling/disabling the encoding feature flag.

Feature flag is set  ->  Hash(normalization(x))
Feature flag and parent inode casefold flag are set  ->  Hash(casefold(x))

The casefold superblock flag would state whether the casefold inode
flags defaults to true or false.

> This assumes that normalization and case sensitivity are completely
> orthogonal.

I'm thinking of casefolding as a special case of the normalization
problem, just because its semantics are interesting for users.  In fact,
it could be seen as just a different normalization function, from the
implementation point of view.

So, it is not completely orthogonal per-se, but it also deserves some
special stuff attention be more useful, like being per-directory, and to
carrying its on activation flags.

> The other thing is there seems to be some debate (and Apple isn't even
> consistent over time) over what kind of normalization is considered
> "best" or "correct".  e.g., NFD, NFC, NFKD, NFKC.  And if you want to
> export the file system over APFS, it might make a difference which one
> you use.  (This is usually the point where some people will assert
> that teaching everyone in the world English really *would* be easier
> than supporting full I18N.  :-) Is this something we can or should
> consider when deciding what we want to support in Linux long-term?

Since the implementation is normalization-preserving on-disk, isn't this
something that can be changed in the future if it is ever needed?
Provided we can rehash the dentries if we need to change the
normalization, a flag in the superblock, stating what normalization
method is used, should suffice if we ever want to support other
normalization methods.  I have to say, It is not in my plans to support
anything other than NFKD. :)

> ... and what I'm really asking is do we really want to be specifying
> whether or not normalization is a Thing as a property of the encoding,
> or a property of the file system (or object, or document) that uses
> that particular encoding?

I see normalization as an inherent property of the encoding, since, for
the user equivalent strings should mean the same thing in the natural
language.  But I see the point of filesystems wanting to ignore
normalization.  I am pending towards the permissive route, where this
can be enabled/disabled when loading a NLS charset table.  This way we
can merge utf8 and utf8n, and satisfy the normalization case, while
keeping compatibility with older users, What do you think?