diff mbox

Question on fscrypt_d_revalidate() and fstest generic/429

Message ID e2d845a3-a863-1506-a0de-576049c84e81@nod.at
State Not Applicable
Headers show

Commit Message

Richard Weinberger May 15, 2017, 2:39 p.m. UTC
Hi!

on UBIFS, fstest generic/429 fails due to -ENFILE because the internal orphan
list reaches the maximum size.
When you unlink a file, the inode goes into the orphan list, in UBIFS' evict() function
it is being removed later. So, only unlinked but used inodes should stay in that list.

If a directory is encrypted, evict() is not being called although the inode has no
users anymore.
It turned out evict() is not being called because fscrypt's fscrypt_d_revalidate()
function.
When I omit fscrypt_set_d_op() in UBIFS code, the test works just fine.

Can it be that fscrypt_d_revalidate() misses the case of i_nlink being 0?
It seem to treat unlinked inodes as already gone and they stay.

The following change makes the problem go away here:

Does this change make sense? TBH, I'm not really an expert in this area and it is also
not clear to me why you don't see these issue on ext4 or f2fs.
Maybe UBIFS' limitations kick in much earlier. ;-)

Thanks,
//richard

Comments

Eric Biggers May 15, 2017, 7:45 p.m. UTC | #1
Hi Richard,

On Mon, May 15, 2017 at 04:39:23PM +0200, Richard Weinberger wrote:
> Hi!
> 
> on UBIFS, fstest generic/429 fails due to -ENFILE because the internal orphan
> list reaches the maximum size.
> When you unlink a file, the inode goes into the orphan list, in UBIFS' evict() function
> it is being removed later. So, only unlinked but used inodes should stay in that list.
> 
> If a directory is encrypted, evict() is not being called although the inode has no
> users anymore.
> It turned out evict() is not being called because fscrypt's fscrypt_d_revalidate()
> function.
> When I omit fscrypt_set_d_op() in UBIFS code, the test works just fine.

Well, I assume you mean the t_encrypted_d_revalidate portion of the test.
generic/429 will still fail overall if you remove fscrypt_set_d_op() --- which
is expected, since it's testing dentry revalidation after all.

> 
> Can it be that fscrypt_d_revalidate() misses the case of i_nlink being 0?
> It seem to treat unlinked inodes as already gone and they stay.
> 
> The following change makes the problem go away here:
> 
> diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c
> index 6d6eca394d4d..d0c19838e513 100644
> --- a/fs/crypto/crypto.c
> +++ b/fs/crypto/crypto.c
> @@ -327,6 +327,7 @@ EXPORT_SYMBOL(fscrypt_decrypt_page);
>  static int fscrypt_d_revalidate(struct dentry *dentry, unsigned int flags)
>  {
>         struct dentry *dir;
> +       struct inode *inode = d_inode(dentry);
>         int dir_has_key, cached_with_key;
> 
>         if (flags & LOOKUP_RCU)
> @@ -359,6 +360,10 @@ static int fscrypt_d_revalidate(struct dentry *dentry, unsigned int flags)
>                         (!cached_with_key && dir_has_key) ||
>                         (cached_with_key && !dir_has_key))
>                 return 0;
> +
> +       if (!inode || inode->i_nlink == 0)
> +               return 0;
> +
>         return 1;
>  }
> 
> Does this change make sense? TBH, I'm not really an expert in this area and it is also
> not clear to me why you don't see these issue on ext4 or f2fs.
> Maybe UBIFS' limitations kick in much earlier. ;-)

The test is repeatedly creating and removing a directory "dir" while lookups are
being done in it.  It seems the problem is that many dentries are being created
for "dir", and they pin many different inodes, all at the same time.  This
actually happens for ext4 too; it just doesn't cause an observable error.

I doubt it's the right solution to make fscrypt_d_revalidate() look at
->i_nlink, since ->d_revalidate() is meant to validate the filename, not the
inode.  I think there is probably a VFS bug that is causing the dentries to not
be freed.

Eric
Richard Weinberger May 15, 2017, 7:51 p.m. UTC | #2
Eric,

Am 15.05.2017 um 21:45 schrieb Eric Biggers:
>> If a directory is encrypted, evict() is not being called although the inode has no
>> users anymore.
>> It turned out evict() is not being called because fscrypt's fscrypt_d_revalidate()
>> function.
>> When I omit fscrypt_set_d_op() in UBIFS code, the test works just fine.
> 
> Well, I assume you mean the t_encrypted_d_revalidate portion of the test.
> generic/429 will still fail overall if you remove fscrypt_set_d_op() --- which
> is expected, since it's testing dentry revalidation after all.

Ah, yes.

>>
>> Can it be that fscrypt_d_revalidate() misses the case of i_nlink being 0?
>> It seem to treat unlinked inodes as already gone and they stay.
>>
>> The following change makes the problem go away here:
>>
>> diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c
>> index 6d6eca394d4d..d0c19838e513 100644
>> --- a/fs/crypto/crypto.c
>> +++ b/fs/crypto/crypto.c
>> @@ -327,6 +327,7 @@ EXPORT_SYMBOL(fscrypt_decrypt_page);
>>  static int fscrypt_d_revalidate(struct dentry *dentry, unsigned int flags)
>>  {
>>         struct dentry *dir;
>> +       struct inode *inode = d_inode(dentry);
>>         int dir_has_key, cached_with_key;
>>
>>         if (flags & LOOKUP_RCU)
>> @@ -359,6 +360,10 @@ static int fscrypt_d_revalidate(struct dentry *dentry, unsigned int flags)
>>                         (!cached_with_key && dir_has_key) ||
>>                         (cached_with_key && !dir_has_key))
>>                 return 0;
>> +
>> +       if (!inode || inode->i_nlink == 0)
>> +               return 0;
>> +
>>         return 1;
>>  }
>>
>> Does this change make sense? TBH, I'm not really an expert in this area and it is also
>> not clear to me why you don't see these issue on ext4 or f2fs.
>> Maybe UBIFS' limitations kick in much earlier. ;-)
> 
> The test is repeatedly creating and removing a directory "dir" while lookups are
> being done in it.  It seems the problem is that many dentries are being created
> for "dir", and they pin many different inodes, all at the same time.  This
> actually happens for ext4 too; it just doesn't cause an observable error.
> 
> I doubt it's the right solution to make fscrypt_d_revalidate() look at
> ->i_nlink, since ->d_revalidate() is meant to validate the filename, not the
> inode.  I think there is probably a VFS bug that is causing the dentries to not
> be freed.

Not sure. Al? :-)

Thanks,
//richard
Eric Biggers May 15, 2017, 11:25 p.m. UTC | #3
On Mon, May 15, 2017 at 09:51:03PM +0200, Richard Weinberger wrote:
> > 
> > The test is repeatedly creating and removing a directory "dir" while lookups are
> > being done in it.  It seems the problem is that many dentries are being created
> > for "dir", and they pin many different inodes, all at the same time.  This
> > actually happens for ext4 too; it just doesn't cause an observable error.
> > 
> > I doubt it's the right solution to make fscrypt_d_revalidate() look at
> > ->i_nlink, since ->d_revalidate() is meant to validate the filename, not the
> > inode.  I think there is probably a VFS bug that is causing the dentries to not
> > be freed.
> 
> Not sure. Al? :-)
> 

I can reproduce this on an unencrypted directory after updating path_init() in
fs/namei.c to always clear LOOKUP_RCU, so that all path lookups are done in
ref-walk mode.  So I think fscrypt_d_revalidate() was only relevant because it
causes all path lookups to drop out of rcu-walk mode.

It seems that what's happening is the "dir" dentries are not being freed because
each one has a child dentry "file" that is a negative dentry.  The "file" dentry
would normally be freed by shrink_dcache_parent() called from vfs_rmdir(), but
due to a race with stat("dir/file") the "file" dentry sometimes has nonzero
reference count at that time, causing it to remain in the subdirs list.  So we
end up with a negative dentry "file" with 0 refcount and on the dentry LRU list,
and its parent the positive dentry "dir" with 1 refcount.  And the test program
generates thousands of copies of that, with each "dir" referring to a different
inode, and they get freed only when the shrinker runs.

I'm not sure how to fix it...

Eric
Al Viro May 15, 2017, 11:50 p.m. UTC | #4
On Mon, May 15, 2017 at 09:51:03PM +0200, Richard Weinberger wrote:

> > I doubt it's the right solution to make fscrypt_d_revalidate() look at
> > ->i_nlink, since ->d_revalidate() is meant to validate the filename, not the
> > inode.  I think there is probably a VFS bug that is causing the dentries to not
> > be freed.
> 
> Not sure. Al? :-)

What's to tell VFS that they are garbage?  If your code doesn't do it,
VFS has no way to know.  And I certainly agree that using ->d_revalidate()
to trigger their unhashing is a bloody odd design; take it up with whoever
had come up with it...
Richard Weinberger May 16, 2017, 6:47 a.m. UTC | #5
Eric,

Am 16.05.2017 um 01:25 schrieb Eric Biggers:
> On Mon, May 15, 2017 at 09:51:03PM +0200, Richard Weinberger wrote:
>>>
>>> The test is repeatedly creating and removing a directory "dir" while lookups are
>>> being done in it.  It seems the problem is that many dentries are being created
>>> for "dir", and they pin many different inodes, all at the same time.  This
>>> actually happens for ext4 too; it just doesn't cause an observable error.
>>>
>>> I doubt it's the right solution to make fscrypt_d_revalidate() look at
>>> ->i_nlink, since ->d_revalidate() is meant to validate the filename, not the
>>> inode.  I think there is probably a VFS bug that is causing the dentries to not
>>> be freed.
>>
>> Not sure. Al? :-)
>>
> 
> I can reproduce this on an unencrypted directory after updating path_init() in
> fs/namei.c to always clear LOOKUP_RCU, so that all path lookups are done in
> ref-walk mode.  So I think fscrypt_d_revalidate() was only relevant because it
> causes all path lookups to drop out of rcu-walk mode.

On ext4 or UBIFS?

Thanks,
//richard
Richard Weinberger May 16, 2017, 11:22 a.m. UTC | #6
Al,

Am 16.05.2017 um 01:50 schrieb Al Viro:
> On Mon, May 15, 2017 at 09:51:03PM +0200, Richard Weinberger wrote:
> 
>>> I doubt it's the right solution to make fscrypt_d_revalidate() look at
>>> ->i_nlink, since ->d_revalidate() is meant to validate the filename, not the
>>> inode.  I think there is probably a VFS bug that is causing the dentries to not
>>> be freed.
>>
>> Not sure. Al? :-)
> 
> What's to tell VFS that they are garbage?  If your code doesn't do it,
> VFS has no way to know.  And I certainly agree that using ->d_revalidate()
> to trigger their unhashing is a bloody odd design; take it up with whoever
> had come up with it...

Not sure whether I understood your answer.
What is the preferred way to inform VFS that the dentry is no longer valid?

Thanks,
//richard
Eric Biggers May 16, 2017, 10:07 p.m. UTC | #7
On Tue, May 16, 2017 at 08:47:43AM +0200, Richard Weinberger wrote:
> Eric,
> 
> Am 16.05.2017 um 01:25 schrieb Eric Biggers:
> > On Mon, May 15, 2017 at 09:51:03PM +0200, Richard Weinberger wrote:
> >>>
> >>> The test is repeatedly creating and removing a directory "dir" while lookups are
> >>> being done in it.  It seems the problem is that many dentries are being created
> >>> for "dir", and they pin many different inodes, all at the same time.  This
> >>> actually happens for ext4 too; it just doesn't cause an observable error.
> >>>
> >>> I doubt it's the right solution to make fscrypt_d_revalidate() look at
> >>> ->i_nlink, since ->d_revalidate() is meant to validate the filename, not the
> >>> inode.  I think there is probably a VFS bug that is causing the dentries to not
> >>> be freed.
> >>
> >> Not sure. Al? :-)
> >>
> > 
> > I can reproduce this on an unencrypted directory after updating path_init() in
> > fs/namei.c to always clear LOOKUP_RCU, so that all path lookups are done in
> > ref-walk mode.  So I think fscrypt_d_revalidate() was only relevant because it
> > causes all path lookups to drop out of rcu-walk mode.
> 
> On ext4 or UBIFS?
> 
> Thanks,
> //richard

Both; the inode "leak" isn't filesystem-specific, beyond the fact that UBIFS
apparently limits the number of inodes on its orphan list while ext4 does not.
(How do I know it happens on ext4 then?  /proc/slabinfo shows that lots of ext4
inodes have been allocated, and after an unclean shutdown and remounting, it's
reported that 50,000+ orphan inodes were deleted.)

There are lots of cases in which path lookups switch from rcu-walk mode into
ref-walk mode, so the fact that it was being caused by ->d_revalidate() in this
specific situation isn't really important, IMO.

I think the actual fix would be something along the lines of making vfs_rmdir()
unhash any negative child dentries, so that they get "killed" by dput() later.

Eric
Richard Weinberger May 16, 2017, 10:27 p.m. UTC | #8
Eric,

Am 17.05.2017 um 00:07 schrieb Eric Biggers:
>>> I can reproduce this on an unencrypted directory after updating path_init() in
>>> fs/namei.c to always clear LOOKUP_RCU, so that all path lookups are done in
>>> ref-walk mode.  So I think fscrypt_d_revalidate() was only relevant because it
>>> causes all path lookups to drop out of rcu-walk mode.
>>
>> On ext4 or UBIFS?
>
> Both; the inode "leak" isn't filesystem-specific, beyond the fact that UBIFS
> apparently limits the number of inodes on its orphan list while ext4 does not.
> (How do I know it happens on ext4 then?  /proc/slabinfo shows that lots of ext4
> inodes have been allocated, and after an unclean shutdown and remounting, it's
> reported that 50,000+ orphan inodes were deleted.)

UBIFS's orphan list is limited by the size of a LEB (logical erase block).
With nandsim's default settings such a LEB is rather small and can store only
2k orphans.

> There are lots of cases in which path lookups switch from rcu-walk mode into
> ref-walk mode, so the fact that it was being caused by ->d_revalidate() in this
> specific situation isn't really important, IMO.
> 
> I think the actual fix would be something along the lines of making vfs_rmdir()
> unhash any negative child dentries, so that they get "killed" by dput() later.

Thanks,
//richard
diff mbox

Patch

diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c
index 6d6eca394d4d..d0c19838e513 100644
--- a/fs/crypto/crypto.c
+++ b/fs/crypto/crypto.c
@@ -327,6 +327,7 @@  EXPORT_SYMBOL(fscrypt_decrypt_page);
 static int fscrypt_d_revalidate(struct dentry *dentry, unsigned int flags)
 {
        struct dentry *dir;
+       struct inode *inode = d_inode(dentry);
        int dir_has_key, cached_with_key;

        if (flags & LOOKUP_RCU)
@@ -359,6 +360,10 @@  static int fscrypt_d_revalidate(struct dentry *dentry, unsigned int flags)
                        (!cached_with_key && dir_has_key) ||
                        (cached_with_key && !dir_has_key))
                return 0;
+
+       if (!inode || inode->i_nlink == 0)
+               return 0;
+
        return 1;
 }