From patchwork Tue Jul 3 17:06:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 938817 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=collabora.co.uk Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 41KrCS1kn1z9s3Z for ; Wed, 4 Jul 2018 03:07:32 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753547AbeGCRHO (ORCPT ); Tue, 3 Jul 2018 13:07:14 -0400 Received: from bhuna.collabora.co.uk ([46.235.227.227]:33332 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753341AbeGCRHN (ORCPT ); Tue, 3 Jul 2018 13:07:13 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 775AE2605C7 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-ext4@vger.kernel.org, darrick.wong@oracle.com, kernel@collabora.com, Gabriel Krisman Bertazi Subject: [PATCH 00/20] EXT4 encoding support Date: Tue, 3 Jul 2018 13:06:40 -0400 Message-Id: <20180703170700.9306-1-krisman@collabora.co.uk> X-Mailer: git-send-email 2.18.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Hi Ted, This patchset implements encoding support as a superblock feature, valid for the entire disk, as an intermediate step to my goal of supporting case-insensitiveness. The superblock records the encoding in a reserved field, but the encoding can also be forced through the use of a mount flag called encoding=. Since not every NLS tables support normalization operations, we limit which encodings can be used by an ext4 volume. Right now, ascii and utf8n are supported, utf8n being a new version of the utf8 charset, but with normalization support using the SGI patches, which are part of this patchset. This patchset also includes the NLS changes that I am proposing, since encoding-awareness depends on NLS and the former patchset didn't get much review beforehand. As usual, I did not include the source ucd files because they would bounce in the list, but a completely functional implementation can be found at: https://gitlab.collabora.com/krisman/linux.git -b ext4-ci-directory I am also not supporting encoding with encrypted directories, given the cost of searching encrypted directories where the diggested name is not normalized. This means that we need to decrypt each filename beforehand, so I decided to simply skip this for now. If the user tries to mount with encoding a directory that has the encryption feature, we simply bail out saying it is not supported. The patchset survives without failures the smoke tests of xfstests, with the obvious exception of generic/453. This test, which verifies that multiple files having similar names (which would match in utf8 normalization) are not the same file, doesn't really make sense with this patchset, since it verifies the fs is *not* encoding aware. I have a patch adding support for encoding detection and skipping this test on xfstests, that I can send once we agreed on the mount point options and the superblock layout changes for this feature. I am also CC'ing Darick for his input on utf8n and NLS. I am not listing this as a new version of the previous NLS patchsets, because it includes a lot more patches than the original patchset, but I left the changelogs from the previous versions for the NLS patches. Please let me know what you think. Gabriel Krisman Bertazi (16): nls: Wrap uni2char/char2uni callers nls: Wrap charset field access nls: Wrap charset hooks in ops structure nls: Split default charset from NLS core nls: Split struct nls_charset from struct nls_table nls: Add support for multiple versions of an encoding nls: Add new interface for string comparisons nls: Let charsets define the behavior of tolower/toupper nls: Add optional normalization and casefold hooks nls: utf8norm: Integrate utf8norm code with NLS subsystem nls: utf8norm: Introduce test module for utf8norm implementation nls: ascii: Support casefold and normalization operations ext4: Include encoding information in the superblock ext4: Support encoding-aware file name lookups vfs: Handle case-exact lookup in d_add_ci ext4: Implement encoding-aware dcache hooks Olaf Weber (4): nls: utf8norm: Add unicode character database files scripts: add trie generator for UTF-8 nls: utf8norm: Introduce code for UTF-8 normalization nls: utf8norm: reduce the size of utf8data[] fs/befs/linuxvfs.c | 8 +- fs/cifs/cifs_unicode.c | 15 +- fs/cifs/cifsfs.c | 2 +- fs/cifs/connect.c | 2 +- fs/cifs/dir.c | 7 +- fs/dcache.c | 33 +- fs/ext4/dir.c | 30 + fs/ext4/ext4.h | 8 +- fs/ext4/namei.c | 60 +- fs/ext4/super.c | 123 + fs/fat/dir.c | 13 +- fs/fat/inode.c | 6 +- fs/fat/namei_vfat.c | 6 +- fs/hfs/super.c | 6 +- fs/hfs/trans.c | 9 +- fs/hfsplus/options.c | 2 +- fs/hfsplus/unicode.c | 6 +- fs/isofs/inode.c | 5 +- fs/isofs/joliet.c | 3 +- fs/jfs/jfs_unicode.c | 9 +- fs/jfs/super.c | 3 +- fs/nls/Kconfig | 13 + fs/nls/Makefile | 19 + fs/nls/mac-celtic.c | 34 +- fs/nls/mac-centeuro.c | 34 +- fs/nls/mac-croatian.c | 34 +- fs/nls/mac-cyrillic.c | 34 +- fs/nls/mac-gaelic.c | 34 +- fs/nls/mac-greek.c | 34 +- fs/nls/mac-iceland.c | 34 +- fs/nls/mac-inuit.c | 34 +- fs/nls/mac-roman.c | 34 +- fs/nls/mac-romanian.c | 34 +- fs/nls/mac-turkish.c | 34 +- fs/nls/nls_ascii.c | 67 +- fs/nls/nls_core.c | 141 ++ fs/nls/nls_cp1250.c | 34 +- fs/nls/nls_cp1251.c | 34 +- fs/nls/nls_cp1255.c | 36 +- fs/nls/nls_cp437.c | 34 +- fs/nls/nls_cp737.c | 34 +- fs/nls/nls_cp775.c | 34 +- fs/nls/nls_cp850.c | 34 +- fs/nls/nls_cp852.c | 34 +- fs/nls/nls_cp855.c | 34 +- fs/nls/nls_cp857.c | 34 +- fs/nls/nls_cp860.c | 34 +- fs/nls/nls_cp861.c | 34 +- fs/nls/nls_cp862.c | 34 +- fs/nls/nls_cp863.c | 34 +- fs/nls/nls_cp864.c | 34 +- fs/nls/nls_cp865.c | 34 +- fs/nls/nls_cp866.c | 34 +- fs/nls/nls_cp869.c | 34 +- fs/nls/nls_cp874.c | 36 +- fs/nls/nls_cp932.c | 36 +- fs/nls/nls_cp936.c | 36 +- fs/nls/nls_cp949.c | 36 +- fs/nls/nls_cp950.c | 36 +- fs/nls/{nls_base.c => nls_default.c} | 124 +- fs/nls/nls_euc-jp.c | 29 +- fs/nls/nls_iso8859-1.c | 34 +- fs/nls/nls_iso8859-13.c | 34 +- fs/nls/nls_iso8859-14.c | 34 +- fs/nls/nls_iso8859-15.c | 34 +- fs/nls/nls_iso8859-2.c | 34 +- fs/nls/nls_iso8859-3.c | 34 +- fs/nls/nls_iso8859-4.c | 34 +- fs/nls/nls_iso8859-5.c | 34 +- fs/nls/nls_iso8859-6.c | 34 +- fs/nls/nls_iso8859-7.c | 34 +- fs/nls/nls_iso8859-9.c | 34 +- fs/nls/nls_koi8-r.c | 34 +- fs/nls/nls_koi8-ru.c | 30 +- fs/nls/nls_koi8-u.c | 34 +- fs/nls/nls_utf8.c | 34 +- fs/nls/nls_utf8n-core.c | 276 ++ fs/nls/nls_utf8n-norm.c | 797 ++++++ fs/nls/nls_utf8n-selftest.c | 307 +++ fs/nls/ucd/README | 33 + fs/nls/utf8n.h | 117 + fs/ntfs/inode.c | 2 +- fs/ntfs/super.c | 6 +- fs/ntfs/unistr.c | 13 +- fs/udf/super.c | 3 +- fs/udf/unicode.c | 4 +- include/linux/nls.h | 127 +- scripts/Makefile | 1 + scripts/mkutf8data.c | 3464 ++++++++++++++++++++++++++ 89 files changed, 7018 insertions(+), 555 deletions(-) create mode 100644 fs/nls/nls_core.c rename fs/nls/{nls_base.c => nls_default.c} (89%) create mode 100644 fs/nls/nls_utf8n-core.c create mode 100644 fs/nls/nls_utf8n-norm.c create mode 100644 fs/nls/nls_utf8n-selftest.c create mode 100644 fs/nls/ucd/README create mode 100644 fs/nls/utf8n.h create mode 100644 scripts/mkutf8data.c