[GIT,PULL] inode->i_version rework for v4.16

Message ID	1517228795.5965.24.camel@redhat.com
State	Not Applicable, archived
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> Message-ID: <1517228795.5965.24.camel@redhat.com> Subject: [GIT PULL] inode->i_version rework for v4.16 From: Jeff Layton <jlayton@redhat.com> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: open list <linux-kernel@vger.kernel.org>, "<linux-fsdevel@vger.kernel.org>" <linux-fsdevel@vger.kernel.org>, Al Viro <viro@zeniv.linux.org.uk>, xfs <linux-xfs@vger.kernel.org>, "open list:NFS, SUNRPC, AND..." <linux-nfs@vger.kernel.org>, linux-btrfs <linux-btrfs@vger.kernel.org>, linux-integrity <linux-integrity@vger.kernel.org>, Andrew Morton <akpm@linux-foundation.org>, "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org> Date: Mon, 29 Jan 2018 07:26:35 -0500 Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk
Series	[GIT,PULL] inode->i_version rework for v4.16 \| expand [GIT,PULL] inode->i_version rework for v4.16

Jeff Layton Jan. 29, 2018, 12:26 p.m. UTC

The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:

  Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git tags/iversion-v4.16-1

for you to fetch changes up to f02a9ad1f15daf4378afeda025a53455f72645dd:

  fs: handle inode->i_version more efficiently (2018-01-29 06:42:21 -0500)

----------------------------------------------------------------
Hi Linus,

This pile of patches is a rework of the inode->i_version field. We have
traditionally incremented that field on every inode data or metadata
change. Typically this increment needs to be logged on disk even when
nothing else has changed, which is rather expensive.

It turns out though that none of the consumers of that field actually
require this behavior. The only real requirement for all of them is that
it be different iff the inode has changed since the last time the field
was checked.

Given that, we can optimize away most of the i_version increments and
avoid dirtying inode metadata when the only change is to the i_version 
and no one is querying it. Queries of the i_version field are rather
rare, so we can help write performance under many common workloads.

This patch series converts existing accesses of the i_version field to a
new API, and then converts all of the in-kernel filesystems to use it.
The last patch in the series then converts the backend implementation to
a scheme that optimizes away a large portion of the metadata updates
when no one is looking at it.

In my own testing this series significantly helps performance with small
I/O sizes. I also got this email for Christmas this year from the kernel
test robot (a 244% r/w bandwidth improvement with XFS over DAX, with 4k
writes):

    https://lkml.org/lkml/2017/12/25/8

A few of the earlier patches in this pile are also flowing to you via
other trees (mm, integrity, and nfsd trees in particular), so there may
be some minor merge conflicts here. Hopefully they're trivial to
resolve, but let me know if there are problems.

Thanks!
----------------------------------------------------------------
Jeff Layton (21):
      lustre: don't set f_version in ll_readdir
      ntfs: remove i_version handling
      fs: new API for handling inode->i_version
      fs: don't take the i_lock in inode_inc_iversion
      fat: convert to new i_version API
      affs: convert to new i_version API
      afs: convert to new i_version API
      btrfs: convert to new i_version API
      exofs: switch to new i_version API
      ext2: convert to new i_version API
      ext4: convert to new i_version API
      nfs: convert to new i_version API
      nfsd: convert to new i_version API
      ocfs2: convert to new i_version API
      ufs: use new i_version API
      xfs: convert to new i_version API
      IMA: switch IMA over to new i_version API
      fs: only set S_VERSION when updating times if necessary
      xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
      btrfs: only dirty the inode in btrfs_update_time if something was changed
      fs: handle inode->i_version more efficiently

Sascha Hauer (1):
      ima: Use i_version only when filesystem supports it

 drivers/staging/lustre/lustre/llite/dir.c |   3 -
 fs/affs/amigaffs.c                        |   5 +-
 fs/affs/dir.c                             |   5 +-
 fs/affs/super.c                           |   3 +-
 fs/afs/fsclient.c                         |   3 +-
 fs/afs/inode.c                            |   5 +-
 fs/btrfs/delayed-inode.c                  |   7 +-
 fs/btrfs/file.c                           |   1 +
 fs/btrfs/inode.c                          |  12 +-
 fs/btrfs/ioctl.c                          |   1 +
 fs/btrfs/tree-log.c                       |   4 +-
 fs/btrfs/xattr.c                          |   1 +
 fs/exofs/dir.c                            |   9 +-
 fs/exofs/super.c                          |   3 +-
 fs/ext2/dir.c                             |   9 +-
 fs/ext2/super.c                           |   5 +-
 fs/ext4/dir.c                             |   9 +-
 fs/ext4/inline.c                          |   7 +-
 fs/ext4/inode.c                           |  13 +-
 fs/ext4/ioctl.c                           |   3 +-
 fs/ext4/namei.c                           |   5 +-
 fs/ext4/super.c                           |   3 +-
 fs/ext4/xattr.c                           |   5 +-
 fs/fat/dir.c                              |   3 +-
 fs/fat/inode.c                            |   9 +-
 fs/fat/namei_msdos.c                      |   7 +-
 fs/fat/namei_vfat.c                       |  22 +-
 fs/inode.c                                |  11 +-
 fs/nfs/delegation.c                       |   3 +-
 fs/nfs/fscache-index.c                    |   5 +-
 fs/nfs/inode.c                            |  18 +-
 fs/nfs/nfs4proc.c                         |  10 +-
 fs/nfs/nfstrace.h                         |   5 +-
 fs/nfs/write.c                            |   8 +-
 fs/nfsd/nfsfh.h                           |   3 +-
 fs/ntfs/inode.c                           |   9 -
 fs/ntfs/mft.c                             |   6 -
 fs/ocfs2/dir.c                            |  15 +-
 fs/ocfs2/inode.c                          |   3 +-
 fs/ocfs2/namei.c                          |   3 +-
 fs/ocfs2/quota_global.c                   |   3 +-
 fs/ufs/dir.c                              |   9 +-
 fs/ufs/inode.c                            |   3 +-
 fs/ufs/super.c                            |   3 +-
 fs/xfs/libxfs/xfs_inode_buf.c             |   7 +-
 fs/xfs/xfs_icache.c                       |   5 +-
 fs/xfs/xfs_inode.c                        |   3 +-
 fs/xfs/xfs_inode_item.c                   |   3 +-
 fs/xfs/xfs_trans_inode.c                  |  16 +-
 include/linux/fs.h                        |  17 +-
 include/linux/iversion.h                  | 341 ++++++++++++++++++++++++++++++
 security/integrity/ima/ima_api.c          |   3 +-
 security/integrity/ima/ima_main.c         |   4 +-
 53 files changed, 525 insertions(+), 153 deletions(-)
 create mode 100644 include/linux/iversion.h

Linus Torvalds Jan. 29, 2018, 9:50 p.m. UTC | #1

On Mon, Jan 29, 2018 at 4:26 AM, Jeff Layton <jlayton@redhat.com> wrote:
>
> This pile of patches is a rework of the inode->i_version field. We have
> traditionally incremented that field on every inode data or metadata
> change. Typically this increment needs to be logged on disk even when
> nothing else has changed, which is rather expensive.

Hmm. I have pulled this, but it is really really broken in one place,
to the degree that I always went "no, I won't pull this garbage".

But the breakage is potential, not actual, and can be fixed trivially,
so I'll let it slide - but I do require it to be fixed. And I require
people to *think* about it.

So what's to horribly horribly wrong?

The inode_cmp_iversion{+raw}() functions are pure and utter crap.

Why?

You say that they return 0/negative/positive, but they do so in a
completely broken manner. They return that ternary value as the
sequence number difference in a 's64', which means that if you
actually care about that ternary value, and do the *sane* thing that
the kernel-doc of the function implies is the right thing, you would
do

    int cmp = inode_cmp_iversion(inode, old);

    if (cmp < 0 ...

and as a result you get code that looks sane, but that doesn't
actually *WORK* right.

To make it even worse, it will actually work in practice by accident
in 99.99999% of all cases, so now you have

 (a) subtly buggy code
 (b) that looks fine
 (c) and that works in testing

which is just about the worst possible case for any code. The
interface is simply garbage that encourages bugs.

And the bug wouldn't be in the user, the bug would be in this code you
just sent me. The interface is simply wrong.

So this absolutely needs to be fixed. I see two fixes:

 - just return a boolean. That's all that any current user actually
wants, so the ternary value seems pointless.

 - make it return an 'int', and not just any int, but -1/0/1. That way
there is no worry about uses, and if somebody *really* cares about the
ternary value, they can now use a "switch" statement to get it
(alternatively, make it return an enum, but whatever).

That "ternary" function that has 18446744069414584320 incorrect return
values really is unacceptable.

                Linus

Linus Torvalds Jan. 29, 2018, 9:51 p.m. UTC | #2

On Mon, Jan 29, 2018 at 1:50 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. I have pulled this, but it is really really broken in one place,
> to the degree that I always went "no, I won't pull this garbage".

always=almost.

I'd blame auto-correct, but I'm not on the phone.

                Linus

Jeff Layton Jan. 30, 2018, 12:05 p.m. UTC | #3

On Mon, 2018-01-29 at 13:50 -0800, Linus Torvalds wrote:
> On Mon, Jan 29, 2018 at 4:26 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > This pile of patches is a rework of the inode->i_version field. We have
> > traditionally incremented that field on every inode data or metadata
> > change. Typically this increment needs to be logged on disk even when
> > nothing else has changed, which is rather expensive.
> 
> Hmm. I have pulled this, but it is really really broken in one place,
> to the degree that I always went "no, I won't pull this garbage".
> 
> But the breakage is potential, not actual, and can be fixed trivially,
> so I'll let it slide - but I do require it to be fixed. And I require
> people to *think* about it.
> 
> So what's to horribly horribly wrong?
> 
> The inode_cmp_iversion{+raw}() functions are pure and utter crap.
> 
> Why?
> 
> You say that they return 0/negative/positive, but they do so in a
> completely broken manner. They return that ternary value as the
> sequence number difference in a 's64', which means that if you
> actually care about that ternary value, and do the *sane* thing that
> the kernel-doc of the function implies is the right thing, you would
> do
> 
>     int cmp = inode_cmp_iversion(inode, old);
> 
>     if (cmp < 0 ...
> 
> and as a result you get code that looks sane, but that doesn't
> actually *WORK* right.
> 

My intent here was to have this handle wraparound using the same sort of
method that the time_before/time_after macros do. Obviously, I didn't
document that well.

I want to make sure I understand what's actually broken here thoug. Is
it only broken when the two values are more than 2**63 apart, or is
there something else more fundamentally wrong here?

> To make it even worse, it will actually work in practice by accident
> in 99.99999% of all cases, so now you have
> 
>  (a) subtly buggy code
>  (b) that looks fine
>  (c) and that works in testing
> 
> which is just about the worst possible case for any code. The
> interface is simply garbage that encourages bugs.
> 
> And the bug wouldn't be in the user, the bug would be in this code you
> just sent me. The interface is simply wrong.
> 
> So this absolutely needs to be fixed. I see two fixes:
> 
>  - just return a boolean. That's all that any current user actually
> wants, so the ternary value seems pointless.
> 
>  - make it return an 'int', and not just any int, but -1/0/1. That way
> there is no worry about uses, and if somebody *really* cares about the
> ternary value, they can now use a "switch" statement to get it
> (alternatively, make it return an enum, but whatever).
> 
> That "ternary" function that has 18446744069414584320 incorrect return
> values really is unacceptable.
> 

I think I'll just make it return a boolean value like you suggested
first. I'll send a patch to fix it once I've done some basic testing
with it.

Many thanks,
--
Jeff Layton <jlayton@redhat.com>

Theodore Ts'o Jan. 30, 2018, 3:15 p.m. UTC | #4

On Tue, Jan 30, 2018 at 07:05:48AM -0500, Jeff Layton wrote:
> 
> I want to make sure I understand what's actually broken here thoug. Is
> it only broken when the two values are more than 2**63 apart, or is
> there something else more fundamentally wrong here?

The other problem is that returning a s64 (or a u64) is extremely
inefficient on a 32-bit platform.  So in general, it's better to avoid
returning a 64-bit type unless it's really necessary....

						- Ted

Linus Torvalds Jan. 30, 2018, 4:36 p.m. UTC | #5

On Tue, Jan 30, 2018 at 4:05 AM, Jeff Layton <jlayton@redhat.com> wrote:
>
> My intent here was to have this handle wraparound using the same sort of
> method that the time_before/time_after macros do. Obviously, I didn't
> document that well.

Oh, the intent was clear. The implementation was wrong.

Note that "time_before()" returns a _boolean_.

So yes, the time comparison functions do a 64-bit subtraction, exactly
like yours do. BUT THEY DO NOT RETURN THAT DIFFERENCE. They test the
sign in 64 bits and return that boolean single-bit value.

> I want to make sure I understand what's actually broken here thoug. Is
> it only broken when the two values are more than 2**63 apart, or is
> there something else more fundamentally wrong here?

There's something fundamentally wrong. The _intent_ is to allow
numbers up to 2**63 apart, but if somebody does that

     int cmp = inode_cmp_iversion(inode, old);

     if (cmp < 0 ...

then it actually only ever tests numbers 2**31 apart, because the high
32 bits will have been thrown away when the 64-bit difference is cast
to "int".

And what used to be a sign bit (in 64 bits) no longer exists, and the
above tests the *new* sign bit that is bit #31, not #63.

So you are a factor of four billion off.

And while 2**63 is a big number and perfectly ok for a filesystem
event difference, a difference of 2**31 is a "this can actually
happen".

Yes, even 2**31 is still a big difference, and it will take a very
unusual situation, and quite a long time to trigger: something that
does a thousand filesystem ops per second will not see the problem for
24 days. So you'll never see it in a test. But 24 days happens in real
life..

That's why you need to do the comparison against zero inside the
actual helper functions like the time comparisons do. Because if you
return the 64-bit difference, it will be trivially lost, and the code
will _look_ right, but not work right.

The fact that you didn't even seem to see the problem in my example
calling sequence just proves that point.

                   Linus

[GIT,PULL] inode->i_version rework for v4.16

Pull-request

Message

Comments