diff mbox series

ext4: add rename whiteout support for fast commit

Message ID 20210316221921.1124955-1-harshadshirwadkar@gmail.com
State Awaiting Upstream
Headers show
Series ext4: add rename whiteout support for fast commit | expand

Commit Message

harshad shirwadkar March 16, 2021, 10:19 p.m. UTC
This patch adds rename whiteout support in fast commits. Note that the
whiteout object that gets created is actually char device. Which
imples, the function ext4_inode_journal_mode(struct inode *inode)
would return "JOURNAL_DATA" for this inode. This has a consequence in
fast commit code that it will make creation of the whiteout object a
fast-commit ineligible behavior and thus will fall back to full
commits. With this patch, this can be observed by running fast commits
with rename whiteout and seeing the stats generated by ext4_fc_stats
tracepoint as follows:

ext4_fc_stats: dev 254:32 fc ineligible reasons:
XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
num_commits:6, ineligible: 6, numblks: 3

So in short, this patch guarantees that in case of rename whiteout, we
fall back to full commits.

Amir mentioned that instead of creating a new whiteout object for
every rename, we can create a static whiteout object with irrelevant
nlink. That will make fast commits to not fall back to full
commit. But until this happens, this patch will ensure correctness by
falling back to full commits.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        | 2 ++
 fs/ext4/fast_commit.c | 9 +++++++--
 fs/ext4/namei.c       | 3 +++
 3 files changed, 12 insertions(+), 2 deletions(-)

Comments

Amir Goldstein March 17, 2021, 9:33 a.m. UTC | #1
On Wed, Mar 17, 2021 at 12:19 AM Harshad Shirwadkar
<harshadshirwadkar@gmail.com> wrote:
>
> This patch adds rename whiteout support in fast commits. Note that the

My only problem with this change is the subject.
It sounds like rename whiteout was not possible and now support was added
and it is now possible. This is not the case.
The truth is that rename whiteout is supported but broken with fast commits.
So the subject should reflect that this is a FIX commit, i.e.:

"ext4: fix rename whiteout with fast commit"

And patch should have a Fixes: tag with the commit that added fast commit
support to rename.

Otherwise, patch has stray newline but the rest looks pretty straightforward
to me.

Thanks,
Amir.
harshad shirwadkar March 19, 2021, 1:32 a.m. UTC | #2
Thanks for the review Amir.

Sure changing the subject makes sense.

Also, on further discussions on Ext4 conference call, we also thought
that with this patch, overlayfs customers would not benefit from fast
commits much if they call renames often. So, in order to really make
rename whiteout a fast commit compatible operation, we probably would
need to add support in fast commit to replay a char device creation
event (since whiteout object is a char device). That would imply, we
would need to do careful versioning and would need to burn an on-disk
feature flag.

An alternative to this would be to have a static whiteout object with
irrelevant nlink count and to have every rename point to that object
instead. Based on how we decide to implement that, at max only the
first rename operation would be fast commit incompatible since that's
when this object would get created. All the further operations would
be fast commit compatible. The big benefit of this approach is that
this way we don't have to add support for char device creation in fast
commit replay code and thus we don't have to worry about versioning.

So, I'm planning to work on that in the near future, but meanwhile we
can carry this patch so that at least we don't break rename whiteout
after fast commit replays.

Thanks,
Harshad

On Wed, Mar 17, 2021 at 2:33 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Mar 17, 2021 at 12:19 AM Harshad Shirwadkar
> <harshadshirwadkar@gmail.com> wrote:
> >
> > This patch adds rename whiteout support in fast commits. Note that the
>
> My only problem with this change is the subject.
> It sounds like rename whiteout was not possible and now support was added
> and it is now possible. This is not the case.
> The truth is that rename whiteout is supported but broken with fast commits.
> So the subject should reflect that this is a FIX commit, i.e.:
>
> "ext4: fix rename whiteout with fast commit"
>
> And patch should have a Fixes: tag with the commit that added fast commit
> support to rename.
>
> Otherwise, patch has stray newline but the rest looks pretty straightforward
> to me.
>
> Thanks,
> Amir.
Amir Goldstein March 19, 2021, 5:51 a.m. UTC | #3
[adding overlayfs list]

On Fri, Mar 19, 2021 at 3:32 AM harshad shirwadkar
<harshadshirwadkar@gmail.com> wrote:
>
> Thanks for the review Amir.
>
> Sure changing the subject makes sense.
>
> Also, on further discussions on Ext4 conference call, we also thought
> that with this patch, overlayfs customers would not benefit from fast
> commits much if they call renames often. So, in order to really make
> rename whiteout a fast commit compatible operation, we probably would
> need to add support in fast commit to replay a char device creation
> event (since whiteout object is a char device). That would imply, we
> would need to do careful versioning and would need to burn an on-disk
> feature flag.
>
> An alternative to this would be to have a static whiteout object with
> irrelevant nlink count and to have every rename point to that object
> instead. Based on how we decide to implement that, at max only the
> first rename operation would be fast commit incompatible since that's
> when this object would get created. All the further operations would
> be fast commit compatible. The big benefit of this approach is that
> this way we don't have to add support for char device creation in fast
> commit replay code and thus we don't have to worry about versioning.
>

I'm glad to hear that, Harshad.

Please note that creating a static whiteout object on-disk is one possible
implementation option. Not creating any object on-disk may be even better.

I had suggested the static object approach because it should be pretty
simple to implement and add e2fsprogs support for.

However, if we look at the requirements for RENAME_WHITEOUT,
the resulting directory entry does not actually need to point to any
object on-disk at all.

An alternative implementation would be to create a directory entry
with {d_ino = 0, d_type = DT_WHT}. Lookup of this entry will return
a reference to a singleton read-only ext4 whiteout inode object, which
does not reside on disk, so fast commit is irrelevant in that sense.
i_nlink should be handled carefully, but that should be easier from
doing that for a static on-disk object.

I am not sure how userland tools handle DT_WHT, but I see that
other filesystems can emit this value in theory and man rename(2)
claims that BSD uses DT_WHT, so the common tools should be
able to handle it.

As far as overlayfs is concerned:
1. ovl_lookup() will find an IS_WHITEOUT() inode as usual
2. ovl_dir_read_merged() will need this small patch (below) and will
    not access the inode object at all
3. At first, ovl_whiteout() -> vfs_whiteout() can still create a usual chardev
4. Later, we can initiate the overlayfs instance singleton whiteout
    reference in ovl_check_rename_whiteout() and ovl_whiteout() will
    never get -EMLINK when linking this whiteout object

One other challenge is how to handle users trying to make operations
on the upper layer directly (migrating images etc).
As long as the tools still observe the whiteout as a chadev (with stat(2))
then export and import should work fine (creating a real chardev on import).

If there are tools that try to change  inode permissions recursively on the
upper layer (?) there may be a problem with those read-only whiteouts
although the permission of a whiteout is a moot concept.

Thanks,
Amir.

--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -161,7 +161,7 @@ static struct ovl_cache_entry
*ovl_cache_entry_new(struct ovl_readdir_data *rdd,
        if (ovl_calc_d_ino(rdd, p))
                p->ino = 0;
        p->is_upper = rdd->is_upper;
-       p->is_whiteout = false;
+       p->is_whiteout = (d_type == DT_WHT);

        if (d_type == DT_CHR) {
                p->next_maybe_whiteout = rdd->first_maybe_whiteout;
Miklos Szeredi March 19, 2021, 8:36 a.m. UTC | #4
On Fri, Mar 19, 2021 at 6:52 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> [adding overlayfs list]
>
> On Fri, Mar 19, 2021 at 3:32 AM harshad shirwadkar
> <harshadshirwadkar@gmail.com> wrote:
> >
> > Thanks for the review Amir.
> >
> > Sure changing the subject makes sense.
> >
> > Also, on further discussions on Ext4 conference call, we also thought
> > that with this patch, overlayfs customers would not benefit from fast
> > commits much if they call renames often. So, in order to really make
> > rename whiteout a fast commit compatible operation, we probably would
> > need to add support in fast commit to replay a char device creation
> > event (since whiteout object is a char device). That would imply, we
> > would need to do careful versioning and would need to burn an on-disk
> > feature flag.
> >
> > An alternative to this would be to have a static whiteout object with
> > irrelevant nlink count and to have every rename point to that object
> > instead. Based on how we decide to implement that, at max only the
> > first rename operation would be fast commit incompatible since that's
> > when this object would get created. All the further operations would
> > be fast commit compatible. The big benefit of this approach is that
> > this way we don't have to add support for char device creation in fast
> > commit replay code and thus we don't have to worry about versioning.
> >
>
> I'm glad to hear that, Harshad.
>
> Please note that creating a static whiteout object on-disk is one possible
> implementation option. Not creating any object on-disk may be even better.

I don't really get it.  What's the advantage of not having an object?

Readdir returning DT_WHT internally might be nice, but I'd be careful
with exporting that to userspace, since it's likely to cause more
problems that it solves.   And on the stat(2) interface adding S_IFWHT
or even worse: ENOENT are really out of the question due to backward
incompatibility with almost every application.

> One other challenge is how to handle users trying to make operations
> on the upper layer directly (migrating images etc).
> As long as the tools still observe the whiteout as a chadev (with stat(2))
> then export and import should work fine (creating a real chardev on import).

Right.

Can't mkfs.ext4 just create the static object?  That sounds to me like
the simplest approach.

Thanks,
Miklos


Thanks,
Miklos
>
> I had suggested the static object approach because it should be pretty
> simple to implement and add e2fsprogs support for.
>
> However, if we look at the requirements for RENAME_WHITEOUT,
> the resulting directory entry does not actually need to point to any
> object on-disk at all.
>
> An alternative implementation would be to create a directory entry
> with {d_ino = 0, d_type = DT_WHT}. Lookup of this entry will return
> a reference to a singleton read-only ext4 whiteout inode object, which
> does not reside on disk, so fast commit is irrelevant in that sense.
> i_nlink should be handled carefully, but that should be easier from
> doing that for a static on-disk object.
>
> I am not sure how userland tools handle DT_WHT, but I see that
> other filesystems can emit this value in theory and man rename(2)
> claims that BSD uses DT_WHT, so the common tools should be
> able to handle it.
>
> As far as overlayfs is concerned:
> 1. ovl_lookup() will find an IS_WHITEOUT() inode as usual
> 2. ovl_dir_read_merged() will need this small patch (below) and will
>     not access the inode object at all
> 3. At first, ovl_whiteout() -> vfs_whiteout() can still create a usual chardev
> 4. Later, we can initiate the overlayfs instance singleton whiteout
>     reference in ovl_check_rename_whiteout() and ovl_whiteout() will
>     never get -EMLINK when linking this whiteout object
>
> If there are tools that try to change  inode permissions recursively on the
> upper layer (?) there may be a problem with those read-only whiteouts
> although the permission of a whiteout is a moot concept.
>
> Thanks,
> Amir.
>
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -161,7 +161,7 @@ static struct ovl_cache_entry
> *ovl_cache_entry_new(struct ovl_readdir_data *rdd,
>         if (ovl_calc_d_ino(rdd, p))
>                 p->ino = 0;
>         p->is_upper = rdd->is_upper;
> -       p->is_whiteout = false;
> +       p->is_whiteout = (d_type == DT_WHT);
>
>         if (d_type == DT_CHR) {
>                 p->next_maybe_whiteout = rdd->first_maybe_whiteout;
Amir Goldstein March 19, 2021, 10:35 a.m. UTC | #5
On Fri, Mar 19, 2021 at 10:36 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Mar 19, 2021 at 6:52 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > [adding overlayfs list]
> >
> > On Fri, Mar 19, 2021 at 3:32 AM harshad shirwadkar
> > <harshadshirwadkar@gmail.com> wrote:
> > >
> > > Thanks for the review Amir.
> > >
> > > Sure changing the subject makes sense.
> > >
> > > Also, on further discussions on Ext4 conference call, we also thought
> > > that with this patch, overlayfs customers would not benefit from fast
> > > commits much if they call renames often. So, in order to really make
> > > rename whiteout a fast commit compatible operation, we probably would
> > > need to add support in fast commit to replay a char device creation
> > > event (since whiteout object is a char device). That would imply, we
> > > would need to do careful versioning and would need to burn an on-disk
> > > feature flag.
> > >
> > > An alternative to this would be to have a static whiteout object with
> > > irrelevant nlink count and to have every rename point to that object
> > > instead. Based on how we decide to implement that, at max only the
> > > first rename operation would be fast commit incompatible since that's
> > > when this object would get created. All the further operations would
> > > be fast commit compatible. The big benefit of this approach is that
> > > this way we don't have to add support for char device creation in fast
> > > commit replay code and thus we don't have to worry about versioning.
> > >
> >
> > I'm glad to hear that, Harshad.
> >
> > Please note that creating a static whiteout object on-disk is one possible
> > implementation option. Not creating any object on-disk may be even better.
>
> I don't really get it.  What's the advantage of not having an object?
>
> Readdir returning DT_WHT internally might be nice, but I'd be careful
> with exporting that to userspace, since it's likely to cause more
> problems that it solves.   And on the stat(2) interface adding S_IFWHT
> or even worse: ENOENT are really out of the question due to backward
> incompatibility with almost every application.
>
> > One other challenge is how to handle users trying to make operations
> > on the upper layer directly (migrating images etc).
> > As long as the tools still observe the whiteout as a chadev (with stat(2))
> > then export and import should work fine (creating a real chardev on import).
>
> Right.
>
> Can't mkfs.ext4 just create the static object?  That sounds to me like
> the simplest approach.
>

Sure. I think that was the original intention and it is probably the easier way.

One thing that we will probably need to do is use the RENAME_WHITEOUT
interface as the explicit way to create the shared whiteout instead of using
vfs_whiteout() for filesystems that support RENAME_WHITEOUT
(we check for RENAME_WHITEOUT support anyway).

The only thing that bothered me in moving from per-ovl-instance singleton
to per-ext4-singleton is what happens if someone tries to (say) chown -R
the upper layer or some other offline modification that was working up to
now and seemed to make sense.

Surely, the ext4 singleton whiteout cannot allow modifications like that,
so what do we do about this? Let those scripts fail (if they exist) and
let their owners fix them to skip errors on whiteouts?

Thanks,
Amir.
Miklos Szeredi March 19, 2021, 10:51 a.m. UTC | #6
On Fri, Mar 19, 2021 at 11:35 AM Amir Goldstein <amir73il@gmail.com> wrote:

> One thing that we will probably need to do is use the RENAME_WHITEOUT
> interface as the explicit way to create the shared whiteout instead of using
> vfs_whiteout() for filesystems that support RENAME_WHITEOUT
> (we check for RENAME_WHITEOUT support anyway).
>
> The only thing that bothered me in moving from per-ovl-instance singleton
> to per-ext4-singleton is what happens if someone tries to (say) chown -R
> the upper layer or some other offline modification that was working up to
> now and seemed to make sense.

Eek.

>
> Surely, the ext4 singleton whiteout cannot allow modifications like that,
> so what do we do about this? Let those scripts fail (if they exist) and
> let their owners fix them to skip errors on whiteouts?

Might try that.  But the no-regressions rule means we'd have to change
that in case it breaks something.

Thanks,
Miklos

> Thanks,
> Amir.
Theodore Ts'o March 21, 2021, 4:39 a.m. UTC | #7
Thanks, applied with Amir's suggested commit summary.

						- Ted
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f6a36a0e07c1..d7039e86d705 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2807,6 +2807,8 @@  void __ext4_fc_track_link(handle_t *handle, struct inode *inode,
 	struct dentry *dentry);
 void ext4_fc_track_unlink(handle_t *handle, struct dentry *dentry);
 void ext4_fc_track_link(handle_t *handle, struct dentry *dentry);
+void __ext4_fc_track_create(handle_t *handle, struct inode *inode,
+			    struct dentry *dentry);
 void ext4_fc_track_create(handle_t *handle, struct dentry *dentry);
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode);
 void ext4_fc_mark_ineligible(struct super_block *sb, int reason);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 619412134bbf..d5c4aaa69665 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -513,10 +513,10 @@  void ext4_fc_track_link(handle_t *handle, struct dentry *dentry)
 	__ext4_fc_track_link(handle, d_inode(dentry), dentry);
 }
 
-void ext4_fc_track_create(handle_t *handle, struct dentry *dentry)
+void __ext4_fc_track_create(handle_t *handle, struct inode *inode,
+			  struct dentry *dentry)
 {
 	struct __track_dentry_update_args args;
-	struct inode *inode = d_inode(dentry);
 	int ret;
 
 	args.dentry = dentry;
@@ -527,6 +527,11 @@  void ext4_fc_track_create(handle_t *handle, struct dentry *dentry)
 	trace_ext4_fc_track_create(inode, dentry, ret);
 }
 
+void ext4_fc_track_create(handle_t *handle, struct dentry *dentry)
+{
+	__ext4_fc_track_create(handle, d_inode(dentry), dentry);
+}
+
 /* __track_fn for inode tracking */
 static int __track_inode(struct inode *inode, void *arg, bool update)
 {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 115762180801..38176c36dda5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3846,6 +3846,7 @@  static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
 		retval = ext4_mark_inode_dirty(handle, whiteout);
 		if (unlikely(retval))
 			goto end_rename;
+
 	}
 	if (!new.bh) {
 		retval = ext4_add_entry(handle, new.dentry, old.inode);
@@ -3919,6 +3920,8 @@  static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
 			ext4_fc_track_unlink(handle, new.dentry);
 		__ext4_fc_track_link(handle, old.inode, new.dentry);
 		__ext4_fc_track_unlink(handle, old.inode, old.dentry);
+		if (whiteout)
+			__ext4_fc_track_create(handle, whiteout, old.dentry);
 	}
 
 	if (new.inode) {