Message ID | 20190906205241.2292-1-agruenba@redhat.com |
---|---|
State | New |
Headers | show |
Series | [Q] gfs2: mmap write vs. punch_hole consistency | expand |
On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote: > Hi, > > I've just fixed a mmap write vs. truncate consistency issue on gfs on > filesystems with a block size smaller that the page size [1]. > > It turns out that the same problem exists between mmap write and hole > punching, and since xfstests doesn't seem to cover that, AFAIA, fsx exercises it pretty often. Certainly it's found problems with XFS in the past w.r.t. these operations. > I've written a > new test [2]. I suspect that what we really want is a test that runs _test_generic_punch using mmap rather than pwrite... > Ext4 and xfs both pass that test; they both apparently > mark the pages that have a hole punched in them as read-only so that > page_mkwrite is called before those pages can be written to again. XFS invalidates the range being hole punched (see xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any attempt to fault that page back in will block on the MMAPLOCK until the hole punch finishes. > gfs2 fails that: for some reason, the partially block-mapped pages are > not marked read-only on gfs2, and so page_mkwrite is not called for the > partially block-mapped pages, and the hole is not filled in correctly. > > The attached patch fixes the problem, but this really doesn't look right > as neither ext4 nor xfs require this kind of hack. So what am I > overlooking, how does this work on ext4 and xfs? XFS uses XFS_MMAPLOCK_* to serialise page faults against extent manipulations (shift, hole punch, truncate, swap, etc) and ext4 uses a similar locking mechanism to do the same thing. Fundamentally, the page cache does not provide the necessary mechanisms to detect and prevent invalidation races inside EOF.... > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> > --- > fs/gfs2/bmap.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c > index 9ef543dd38e2..e677e813be4c 100644 > --- a/fs/gfs2/bmap.c > +++ b/fs/gfs2/bmap.c > @@ -2475,6 +2475,13 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length) > if (error) > goto out; > } > + /* > + * If the first or last page partially lies in the hole, mark > + * the page read-only so that memory-mapped writes will trigger > + * page_mkwrite. > + */ > + pagecache_isize_extended(inode, offset, inode->i_size); > + pagecache_isize_extended(inode, offset + length, inode->i_size); See xfs_flush_unmap_range(), which is run under XFS_MMAPLOCK_EXCL to serialise against incoming page faults... Cheers, Dave.
On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner <david@fromorbit.com> wrote: > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote: > > Hi, > > > > I've just fixed a mmap write vs. truncate consistency issue on gfs on > > filesystems with a block size smaller that the page size [1]. > > > > It turns out that the same problem exists between mmap write and hole > > punching, and since xfstests doesn't seem to cover that, > > AFAIA, fsx exercises it pretty often. Certainly it's found problems > with XFS in the past w.r.t. these operations. > > > I've written a > > new test [2]. > > I suspect that what we really want is a test that runs > _test_generic_punch using mmap rather than pwrite... > > > Ext4 and xfs both pass that test; they both apparently > > mark the pages that have a hole punched in them as read-only so that > > page_mkwrite is called before those pages can be written to again. > > XFS invalidates the range being hole punched (see > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any > attempt to fault that page back in will block on the MMAPLOCK until > the hole punch finishes. This isn't about writes during the hole punching, this is about writes once the hole is punched. For example, the test case I've posted creates the following file layout with 1k blocksize: DDDD DDDD DDDD Then it punches a hole like this: DDHH HHHH HHDD Then it fills the hole again with mwrite: DDDD DDDD DDDD As far as I can tell, that needs to trigger page faults on all three pages. I did get these on ext4; judging from the fact that xfs works, the also seem to occur there; but on gfs2, page_mkwrite isn't called for the two partially mapped pages, only for the page in the middle that's entirely within the hole. And I don't see where those pages are marked read-only; it appears like pagecache_isize_extended isn't called on ext4 or xfs. So how does this work there? > > gfs2 fails that: for some reason, the partially block-mapped pages are > > not marked read-only on gfs2, and so page_mkwrite is not called for the > > partially block-mapped pages, and the hole is not filled in correctly. > > > > The attached patch fixes the problem, but this really doesn't look right > > as neither ext4 nor xfs require this kind of hack. So what am I > > overlooking, how does this work on ext4 and xfs? > > XFS uses XFS_MMAPLOCK_* to serialise page faults against extent > manipulations (shift, hole punch, truncate, swap, etc) and ext4 uses > a similar locking mechanism to do the same thing. Fundamentally, the > page cache does not provide the necessary mechanisms to detect and > prevent invalidation races inside EOF.... Yes, but that unfortunately doesn't answer my question. Thanks, Andreas > > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> > > --- > > fs/gfs2/bmap.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c > > index 9ef543dd38e2..e677e813be4c 100644 > > --- a/fs/gfs2/bmap.c > > +++ b/fs/gfs2/bmap.c > > @@ -2475,6 +2475,13 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length) > > if (error) > > goto out; > > } > > + /* > > + * If the first or last page partially lies in the hole, mark > > + * the page read-only so that memory-mapped writes will trigger > > + * page_mkwrite. > > + */ > > + pagecache_isize_extended(inode, offset, inode->i_size); > > + pagecache_isize_extended(inode, offset + length, inode->i_size); > > See xfs_flush_unmap_range(), which is run under XFS_MMAPLOCK_EXCL > to serialise against incoming page faults... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Sep 06, 2019 at 11:48:31PM +0200, Andreas Gruenbacher wrote: > On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote: > > > Hi, > > > > > > I've just fixed a mmap write vs. truncate consistency issue on gfs on > > > filesystems with a block size smaller that the page size [1]. > > > > > > It turns out that the same problem exists between mmap write and hole > > > punching, and since xfstests doesn't seem to cover that, > > > > AFAIA, fsx exercises it pretty often. Certainly it's found problems > > with XFS in the past w.r.t. these operations. > > > > > I've written a > > > new test [2]. > > > > I suspect that what we really want is a test that runs > > _test_generic_punch using mmap rather than pwrite... > > > > > Ext4 and xfs both pass that test; they both apparently > > > mark the pages that have a hole punched in them as read-only so that > > > page_mkwrite is called before those pages can be written to again. > > > > XFS invalidates the range being hole punched (see > > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any > > attempt to fault that page back in will block on the MMAPLOCK until > > the hole punch finishes. > > This isn't about writes during the hole punching, this is about writes > once the hole is punched. How do you prevent this: hole punch process: other process clean page invalidate page write page fault instantiate new page page_mkwrite page dirty do hole punch overwrite hole Firstly, you end up with a dirty page over a hole with no backing store, and second, you get no page fault on overwrite because the pages are already dirty. That's the problem the MMAPLOCK in XFS solves - it's integral to ensuring the page faults on mmap write are correctly serialised with both the page cache invalidation and the underlying extent manipulation that the hole punch operation then does. i.e. if you want a page fault /after/ a hole punch, you have to prevent it occurring during the hole punch after the page has already been marked clean and/or invalidated. > For example, the test case I've posted > creates the following file layout with 1k blocksize: > > DDDD DDDD DDDD > > Then it punches a hole like this: > > DDHH HHHH HHDD > > Then it fills the hole again with mwrite: > > DDDD DDDD DDDD > > As far as I can tell, that needs to trigger page faults on all three > pages. Yes, but only /after/ the hole has been punched. > I did get these on ext4; judging from the fact that xfs works, > the also seem to occur there; but on gfs2, page_mkwrite isn't called > for the two partially mapped pages, only for the page in the middle > that's entirely within the hole. And I don't see where those pages are > marked read-only; it appears like pagecache_isize_extended isn't > called on ext4 or xfs. So how does this work there? As I said in my last response, the answer is in xfs_flush_unmap_range(). That uses truncate_pagecache_range() to do the necessary page cache manipulation. Cheers, Dave.
On Fri 06-09-19 23:48:31, Andreas Gruenbacher wrote: > On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote: > > > Hi, > > > > > > I've just fixed a mmap write vs. truncate consistency issue on gfs on > > > filesystems with a block size smaller that the page size [1]. > > > > > > It turns out that the same problem exists between mmap write and hole > > > punching, and since xfstests doesn't seem to cover that, > > > > AFAIA, fsx exercises it pretty often. Certainly it's found problems > > with XFS in the past w.r.t. these operations. > > > > > I've written a > > > new test [2]. > > > > I suspect that what we really want is a test that runs > > _test_generic_punch using mmap rather than pwrite... > > > > > Ext4 and xfs both pass that test; they both apparently > > > mark the pages that have a hole punched in them as read-only so that > > > page_mkwrite is called before those pages can be written to again. > > > > XFS invalidates the range being hole punched (see > > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any > > attempt to fault that page back in will block on the MMAPLOCK until > > the hole punch finishes. > > This isn't about writes during the hole punching, this is about writes > once the hole is punched. For example, the test case I've posted > creates the following file layout with 1k blocksize: > > DDDD DDDD DDDD > > Then it punches a hole like this: > > DDHH HHHH HHDD > > Then it fills the hole again with mwrite: > > DDDD DDDD DDDD > > As far as I can tell, that needs to trigger page faults on all three > pages. I did get these on ext4; judging from the fact that xfs works, > the also seem to occur there; but on gfs2, page_mkwrite isn't called > for the two partially mapped pages, only for the page in the middle > that's entirely within the hole. And I don't see where those pages are > marked read-only; it appears like pagecache_isize_extended isn't > called on ext4 or xfs. So how does this work there? The trick ext4 & xfs use is that they writeout the range being punched first (see e.g. ext4_punch_hole() calling filemap_write_and_wait_range() or xfs_flush_unmap_range() called from xfs_free_file_space()). This writeout also has the effect that all the page mappings for that range get write-protected. Another related issue is what Dave points out: Even if you use writeout to writeprotect pages, GFS2 still seems to have a race where page fault can come while you are freeing blocks and if you allow that you usually get into a problematic state. Effects depend on fs implementation details but usually it can result in stale data exposure or fs corruption. Honza
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 9ef543dd38e2..e677e813be4c 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -2475,6 +2475,13 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length) if (error) goto out; } + /* + * If the first or last page partially lies in the hole, mark + * the page read-only so that memory-mapped writes will trigger + * page_mkwrite. + */ + pagecache_isize_extended(inode, offset, inode->i_size); + pagecache_isize_extended(inode, offset + length, inode->i_size); } if (gfs2_is_jdata(ip)) {
Hi, I've just fixed a mmap write vs. truncate consistency issue on gfs on filesystems with a block size smaller that the page size [1]. It turns out that the same problem exists between mmap write and hole punching, and since xfstests doesn't seem to cover that, I've written a new test [2]. Ext4 and xfs both pass that test; they both apparently mark the pages that have a hole punched in them as read-only so that page_mkwrite is called before those pages can be written to again. gfs2 fails that: for some reason, the partially block-mapped pages are not marked read-only on gfs2, and so page_mkwrite is not called for the partially block-mapped pages, and the hole is not filled in correctly. The attached patch fixes the problem, but this really doesn't look right as neither ext4 nor xfs require this kind of hack. So what am I overlooking, how does this work on ext4 and xfs? Thanks, Andreas [1] gfs2: Improve mmap write vs. truncate consistency https://www.redhat.com/archives/cluster-devel/2019-September/msg00009.html [2] generic/567: test mmap write vs. hole punching https://www.spinics.net/lists/fstests/msg12474.html [PATCH] gfs2: Improve mmap write vs. punch_hole consistency Fixes xfstest generic/567. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> --- fs/gfs2/bmap.c | 7 +++++++ 1 file changed, 7 insertions(+)