Message ID | ec267e95fd21891986373c7af1c72b4c8b507332.1376679411.git.luto@amacapital.net |
---|---|
State | Superseded, archived |
Headers | show |
On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote: > Filesystems that defer cmtime updates should update cmtime when any > of these events happen after a write via a mapping: > > - The mapping is written back to disk. This happens from all kinds > of places, all of which eventually call ->writepages. > > - munmap is called or the mapping is removed when the process exits > > - msync(MS_ASYNC) is called. Linux currently does nothing for > msync(MS_ASYNC), but POSIX says that cmtime should be updated some > time between an mmaped write and the subsequent msync call. > MS_SYNC calls ->writepages, but MS_ASYNC needs special handling. > > Filesystmes that defer cmtime updates should flush them on munmap or > exit. Finding out that this happened through vm_ops is messy, so > add a new address space op for this. > > It's not strictly necessary to call ->flush_cmtime after ->writepages, > but it simplifies the fs code. As an optional optimization, > filesystems can call mapping_test_clear_cmtime themselves in > ->writepages (as long as they're careful to scan all the pages first > -- the cmtime bit may not be set when ->writepages is entered). .flush_cmtime is effectively a duplicate method. We already have .update_time to notify filesystems that they need to update the timestamp in the inode transactionally. Indeed: > + /* > + * Userspace expects certain system calls to update cmtime if > + * a file has been recently written using a shared vma. In > + * cases where cmtime may need to be updated but writepages is > + * not called, this is called instead. (Implementations > + * should call mapping_test_clear_cmtime.) > + */ > + void (*flush_cmtime)(struct address_space *); You say it can be implemented in the ->writepage(s) method, and all filesystems provide ->writepage(s) in some form. Therefore I would have thought it be best to simply require filesystems to check that mapping flag during those methods and update the inode directly when that is set? Indeed, the way you've set up the infrastructure, we'll have to rewrite the cmtime update code to enable writepages to update this within some other transaction. Perhaps you should just implement it that way first? > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) > ret = mapping->a_ops->writepages(mapping, wbc); > else > ret = generic_writepages(mapping, wbc); > + > + /* > + * ->writepages will call clear_page_dirty_for_io, which may, in turn, > + * mark the mapping for deferred cmtime update. As an optimization, > + * a filesystem can flush the update at the end of ->writepages > + * (possibly avoiding a journal transaction, for example), but, > + * for simplicity, let filesystems skip that part and just implement > + * ->flush_cmtime. > + */ > + if (mapping->a_ops->flush_cmtime) > + mapping->a_ops->flush_cmtime(mapping); And that's where you cannot call sb_pagefault_start/end.... Cheers, Dave.
On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote: >> Filesystems that defer cmtime updates should update cmtime when any >> of these events happen after a write via a mapping: >> >> - The mapping is written back to disk. This happens from all kinds >> of places, all of which eventually call ->writepages. >> >> - munmap is called or the mapping is removed when the process exits >> >> - msync(MS_ASYNC) is called. Linux currently does nothing for >> msync(MS_ASYNC), but POSIX says that cmtime should be updated some >> time between an mmaped write and the subsequent msync call. >> MS_SYNC calls ->writepages, but MS_ASYNC needs special handling. >> >> Filesystmes that defer cmtime updates should flush them on munmap or >> exit. Finding out that this happened through vm_ops is messy, so >> add a new address space op for this. >> >> It's not strictly necessary to call ->flush_cmtime after ->writepages, >> but it simplifies the fs code. As an optional optimization, >> filesystems can call mapping_test_clear_cmtime themselves in >> ->writepages (as long as they're careful to scan all the pages first >> -- the cmtime bit may not be set when ->writepages is entered). > > .flush_cmtime is effectively a duplicate method. We already have > .update_time to notify filesystems that they need to update the > timestamp in the inode transactionally. .update_time is used for the atime update as well, and it relies on the core code to update the in-memory timestamp first. I used that approach in v2, but it was (correctly, I think) pointed out that this was a layering violation and that core code shouldn't be mucking with the timestamps directly during writeback. There was a recent effort to move most of the file_update_calls from the core into .page_mkwrite, and I don't think anyone wants to undo that. > > Indeed: > >> + /* >> + * Userspace expects certain system calls to update cmtime if >> + * a file has been recently written using a shared vma. In >> + * cases where cmtime may need to be updated but writepages is >> + * not called, this is called instead. (Implementations >> + * should call mapping_test_clear_cmtime.) >> + */ >> + void (*flush_cmtime)(struct address_space *); > > You say it can be implemented in the ->writepage(s) method, and all > filesystems provide ->writepage(s) in some form. Therefore I would > have thought it be best to simply require filesystems to check that > mapping flag during those methods and update the inode directly when > that is set? The problem with only doing it in ->writepages is that calling writepages from munmap and exit would probably hurt performance for no particular gain. So I need some kind of callback to say "update the time, but don't write data." The AS_CMTIME bit will still be set when the ptes are removed. I could require ->writepages *and* ->flush_cmtime to handle the time update, but that would complicate non-transactional filesystems. Those filesystems should just flush cmtime at the end of writepages. > > Indeed, the way you've set up the infrastructure, we'll have to > rewrite the cmtime update code to enable writepages to update this > within some other transaction. Perhaps you should just implement it > that way first? This is already possible although not IMO necessary for correctness. All that ext4 would need to do is to add something like: if (mapping_test_clear_cmtime(mapping)) { update times within current transaction } somewhere inside the transaction in writepages. There would probably be room for some kind of generic helper to do everything in inode_update_time_writable except for the actual mark_inode_dirty part, but this still seems nasty from a locking perspective, and I'd rather leave that optimization to an ext4 developer who wants to do it. I could simplify this a bit by moving the mapping_test_clear_cmtime part from .flush_cmtime to its callers. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote: > On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote: > >> Filesystems that defer cmtime updates should update cmtime when any > >> of these events happen after a write via a mapping: > >> > >> - The mapping is written back to disk. This happens from all kinds > >> of places, all of which eventually call ->writepages. > >> > >> - munmap is called or the mapping is removed when the process exits > >> > >> - msync(MS_ASYNC) is called. Linux currently does nothing for > >> msync(MS_ASYNC), but POSIX says that cmtime should be updated some > >> time between an mmaped write and the subsequent msync call. > >> MS_SYNC calls ->writepages, but MS_ASYNC needs special handling. > >> > >> Filesystmes that defer cmtime updates should flush them on munmap or > >> exit. Finding out that this happened through vm_ops is messy, so > >> add a new address space op for this. > >> > >> It's not strictly necessary to call ->flush_cmtime after ->writepages, > >> but it simplifies the fs code. As an optional optimization, > >> filesystems can call mapping_test_clear_cmtime themselves in > >> ->writepages (as long as they're careful to scan all the pages first > >> -- the cmtime bit may not be set when ->writepages is entered). > > > > .flush_cmtime is effectively a duplicate method. We already have > > .update_time to notify filesystems that they need to update the > > timestamp in the inode transactionally. > > .update_time is used for the atime update as well, and it relies on > the core code to update the in-memory timestamp first. No, it doesn't. The caller of .update_time provides the timestamp to that gets written into the relevant fields of the inode. i.e. this code: now = current_fs_time(inode->i_sb); if (inode->i_op->update_time) return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME); will update *only* the ctime and mtime of the inode to have a value of @now. That's precisely what you want to do, yes? Indeed, your .flush_cmtime function effectively does: flags = prepare_cmtime(inode, &now) { *now = current_fs_time() flags = S_CTIME|S_MTIME; } update_time(inode, now, flags); { inode->i_op->update_time(inode, now, flags) or mark_inode_dirty_sync() } IOWs, you're adding a filesystem specific method to update a timestamp that wraps around a generic method for updating a timestamp. i.e. .flush_cmtime is not necessary - you could just call inode_update_time_writable() from generic_writepages() and it would simply just work for all filesystems.... > I used that > approach in v2, but it was (correctly, I think) pointed out that this > was a layering violation and that core code shouldn't be mucking with > the timestamps directly during writeback. If calling a generic method to update the timestamp is a layering violation, then why is replacing that with a filesystem specific method of updating a timestamp not a layering violation? > > Indeed: > > > >> + /* > >> + * Userspace expects certain system calls to update cmtime if > >> + * a file has been recently written using a shared vma. In > >> + * cases where cmtime may need to be updated but writepages is > >> + * not called, this is called instead. (Implementations > >> + * should call mapping_test_clear_cmtime.) > >> + */ > >> + void (*flush_cmtime)(struct address_space *); > > > > You say it can be implemented in the ->writepage(s) method, and all > > filesystems provide ->writepage(s) in some form. Therefore I would > > have thought it be best to simply require filesystems to check that > > mapping flag during those methods and update the inode directly when > > that is set? > > The problem with only doing it in ->writepages is that calling > writepages from munmap and exit would probably hurt performance for no > particular gain. So I need some kind of callback to say "update the > time, but don't write data." The AS_CMTIME bit will still be set when > the ptes are removed. What's the point of updating the timestamp at unmap when it's going to be changed again when the writeback occurs? > I could require ->writepages *and* ->flush_cmtime to handle the time > update, but that would complicate non-transactional filesystems. > Those filesystems should just flush cmtime at the end of writepages. do_writepages() is the wrong place to do such updates - we can get writeback directly through .writepage, so the time updates need to be in .writepage. That first .writepage call will clear the bit on the mapping, so it's only done on the first call to .writepage on the given mapping. And some filesystems provide a .writepages method that doesn't call .writepage, so those filesystems will need to do the timestamp update in those methods. > > Indeed, the way you've set up the infrastructure, we'll have to > > rewrite the cmtime update code to enable writepages to update this > > within some other transaction. Perhaps you should just implement it > > that way first? > > This is already possible although not IMO necessary for correctness. > All that ext4 would need to do is to add something like: > > if (mapping_test_clear_cmtime(mapping)) { > update times within current transaction > } Where does it get the timestamps from? i.e. the update could call prepare_update_cmtime() to do this, yes? And so having a wrapper that does if (mapping_test_clear_cmtime(mapping)) return prepare_update_cmtime(inode); return 0; would work just fine for them, yes? > somewhere inside the transaction in writepages. There would probably > be room for some kind of generic helper to do everything in > inode_update_time_writable except for the actual mark_inode_dirty > part, but this still seems nasty from a locking perspective, and I'd > rather leave that optimization to an ext4 developer who wants to do > it. filesystems that implement .update_time don't need to mark the struct inode dirty on timestamp updates. Similarly, filesystems that use a writepages transaction to do the update don't either.... Cheers, Dave.
On Mon, Aug 19, 2013 at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote: >> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote: >> >> Filesystems that defer cmtime updates should update cmtime when any >> >> of these events happen after a write via a mapping: >> >> >> >> - The mapping is written back to disk. This happens from all kinds >> >> of places, all of which eventually call ->writepages. >> >> >> >> - munmap is called or the mapping is removed when the process exits >> >> >> >> - msync(MS_ASYNC) is called. Linux currently does nothing for >> >> msync(MS_ASYNC), but POSIX says that cmtime should be updated some >> >> time between an mmaped write and the subsequent msync call. >> >> MS_SYNC calls ->writepages, but MS_ASYNC needs special handling. >> >> >> >> Filesystmes that defer cmtime updates should flush them on munmap or >> >> exit. Finding out that this happened through vm_ops is messy, so >> >> add a new address space op for this. >> >> >> >> It's not strictly necessary to call ->flush_cmtime after ->writepages, >> >> but it simplifies the fs code. As an optional optimization, >> >> filesystems can call mapping_test_clear_cmtime themselves in >> >> ->writepages (as long as they're careful to scan all the pages first >> >> -- the cmtime bit may not be set when ->writepages is entered). >> > >> > .flush_cmtime is effectively a duplicate method. We already have >> > .update_time to notify filesystems that they need to update the >> > timestamp in the inode transactionally. >> >> .update_time is used for the atime update as well, and it relies on >> the core code to update the in-memory timestamp first. > > No, it doesn't. The caller of .update_time provides the timestamp to > that gets written into the relevant fields of the inode. > > i.e. this code: > > now = current_fs_time(inode->i_sb); > if (inode->i_op->update_time) > return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME); > > will update *only* the ctime and mtime of the inode to have a value > of @now. That's precisely what you want to do, yes? > > Indeed, your .flush_cmtime function effectively does: > > flags = prepare_cmtime(inode, &now) > { *now = current_fs_time() > flags = S_CTIME|S_MTIME; > } > update_time(inode, now, flags); > { > inode->i_op->update_time(inode, now, flags) > or > mark_inode_dirty_sync() > } > > IOWs, you're adding a filesystem specific method to update a > timestamp that wraps around a generic method for updating a > timestamp. i.e. .flush_cmtime is not necessary - you could just > call inode_update_time_writable() from generic_writepages() and it > would simply just work for all filesystems.... OK, I'll try that out. > >> I used that >> approach in v2, but it was (correctly, I think) pointed out that this >> was a layering violation and that core code shouldn't be mucking with >> the timestamps directly during writeback. > > If calling a generic method to update the timestamp is a layering > violation, then why is replacing that with a filesystem specific > method of updating a timestamp not a layering violation? > >> > Indeed: >> > >> >> + /* >> >> + * Userspace expects certain system calls to update cmtime if >> >> + * a file has been recently written using a shared vma. In >> >> + * cases where cmtime may need to be updated but writepages is >> >> + * not called, this is called instead. (Implementations >> >> + * should call mapping_test_clear_cmtime.) >> >> + */ >> >> + void (*flush_cmtime)(struct address_space *); >> > >> > You say it can be implemented in the ->writepage(s) method, and all >> > filesystems provide ->writepage(s) in some form. Therefore I would >> > have thought it be best to simply require filesystems to check that >> > mapping flag during those methods and update the inode directly when >> > that is set? >> >> The problem with only doing it in ->writepages is that calling >> writepages from munmap and exit would probably hurt performance for no >> particular gain. So I need some kind of callback to say "update the >> time, but don't write data." The AS_CMTIME bit will still be set when >> the ptes are removed. > > What's the point of updating the timestamp at unmap when it's going > to be changed again when the writeback occurs? To avoid breaking make and similar tools. Suppose that an output file already exists but is stale. Some program gets called to update it, and it opens it for write, mmaps it, updates it, calls munmap, and exits. Its parent will expect the timestamp to have been updated, but writeback may not have happened. > >> I could require ->writepages *and* ->flush_cmtime to handle the time >> update, but that would complicate non-transactional filesystems. >> Those filesystems should just flush cmtime at the end of writepages. > > do_writepages() is the wrong place to do such updates - we can get > writeback directly through .writepage, so the time updates need to > be in .writepage. That first .writepage call will clear the bit on > the mapping, so it's only done on the first call to .writepage on > the given mapping. Last time I checked, all the paths that actually needed the timestamp update went through .writepages. I'll double-check. > > And some filesystems provide a .writepages method that doesn't call > .writepage, so those filesystems will need to do the timestamp > update in those methods. > >> > Indeed, the way you've set up the infrastructure, we'll have to >> > rewrite the cmtime update code to enable writepages to update this >> > within some other transaction. Perhaps you should just implement it >> > that way first? >> >> This is already possible although not IMO necessary for correctness. >> All that ext4 would need to do is to add something like: >> >> if (mapping_test_clear_cmtime(mapping)) { >> update times within current transaction >> } > > Where does it get the timestamps from? i.e. the update could call > prepare_update_cmtime() to do this, yes? And so having a wrapper > that does > > if (mapping_test_clear_cmtime(mapping)) > return prepare_update_cmtime(inode); > return 0; > > would work just fine for them, yes? > >> somewhere inside the transaction in writepages. There would probably >> be room for some kind of generic helper to do everything in >> inode_update_time_writable except for the actual mark_inode_dirty >> part, but this still seems nasty from a locking perspective, and I'd >> rather leave that optimization to an ext4 developer who wants to do >> it. > > filesystems that implement .update_time don't need to mark the > struct inode dirty on timestamp updates. Similarly, filesystems that > use a writepages transaction to do the update don't either.... Fair enough. I'll spin a v4 and see how it looks. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: > >> I could require ->writepages *and* ->flush_cmtime to handle the time > >> update, but that would complicate non-transactional filesystems. > >> Those filesystems should just flush cmtime at the end of writepages. > > > > do_writepages() is the wrong place to do such updates - we can get > > writeback directly through .writepage, so the time updates need to > > be in .writepage. That first .writepage call will clear the bit on > > the mapping, so it's only done on the first call to .writepage on > > the given mapping. > > Last time I checked, all the paths that actually needed the timestamp > update went through .writepages. I'll double-check. kswapd can call just .writepage to do the writeout so timestamp update should be handled there as well. Otherwise all pages in a mapping can be cleaned without timestamp being updated. Which btw made me realize that even your scheme doesn't completely make sure timestamp is updated after mmap write - if you have pages 0 and 1, you write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096) is called. We write the page 0, writeprotect it, update timestamps. But page 1 is still writeable so writes to it won't set CMTIME flag, neither update the timestamp... Not that I think this can be reasonably solved but it is a food for thought. Honza
On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote: > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: >> >> I could require ->writepages *and* ->flush_cmtime to handle the time >> >> update, but that would complicate non-transactional filesystems. >> >> Those filesystems should just flush cmtime at the end of writepages. >> > >> > do_writepages() is the wrong place to do such updates - we can get >> > writeback directly through .writepage, so the time updates need to >> > be in .writepage. That first .writepage call will clear the bit on >> > the mapping, so it's only done on the first call to .writepage on >> > the given mapping. >> >> Last time I checked, all the paths that actually needed the timestamp >> update went through .writepages. I'll double-check. > kswapd can call just .writepage to do the writeout so timestamp update > should be handled there as well. Otherwise all pages in a mapping can be > cleaned without timestamp being updated. OK, I'll fix that. > > Which btw made me realize that even your scheme doesn't completely make > sure timestamp is updated after mmap write - if you have pages 0 and 1, you > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096) > is called. We write the page 0, writeprotect it, update timestamps. But > page 1 is still writeable so writes to it won't set CMTIME flag, neither > update the timestamp... Not that I think this can be reasonably solved but > it is a food for thought. This should already work. AS_CMTIME is set when the pte goes from dirty to clean, not when the pte goes from wp to writable. So whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will be set and a subsequent writepages call will update the timestamp. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 20, 2013 at 9:42 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote: >> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: >>> >> I could require ->writepages *and* ->flush_cmtime to handle the time >>> >> update, but that would complicate non-transactional filesystems. >>> >> Those filesystems should just flush cmtime at the end of writepages. >>> > >>> > do_writepages() is the wrong place to do such updates - we can get >>> > writeback directly through .writepage, so the time updates need to >>> > be in .writepage. That first .writepage call will clear the bit on >>> > the mapping, so it's only done on the first call to .writepage on >>> > the given mapping. >>> >>> Last time I checked, all the paths that actually needed the timestamp >>> update went through .writepages. I'll double-check. >> kswapd can call just .writepage to do the writeout so timestamp update >> should be handled there as well. Otherwise all pages in a mapping can be >> cleaned without timestamp being updated. > > OK, I'll fix that. This is a bit ugly. mpage_writepages and generic_writepages both call ->writepage. If writepage starts checking cmtime, then writepages will do multiple timestamp updates on filesystems that use this code. I can modify vmscan and migrate to flush cmtime -- they seem to be the only callers of ->writepage that aren't themselves called from ->writepages. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote: > On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote: > > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: > >> >> I could require ->writepages *and* ->flush_cmtime to handle the time > >> >> update, but that would complicate non-transactional filesystems. > >> >> Those filesystems should just flush cmtime at the end of writepages. > >> > > >> > do_writepages() is the wrong place to do such updates - we can get > >> > writeback directly through .writepage, so the time updates need to > >> > be in .writepage. That first .writepage call will clear the bit on > >> > the mapping, so it's only done on the first call to .writepage on > >> > the given mapping. > >> > >> Last time I checked, all the paths that actually needed the timestamp > >> update went through .writepages. I'll double-check. > > kswapd can call just .writepage to do the writeout so timestamp update > > should be handled there as well. Otherwise all pages in a mapping can be > > cleaned without timestamp being updated. > > OK, I'll fix that. > > > > > Which btw made me realize that even your scheme doesn't completely make > > sure timestamp is updated after mmap write - if you have pages 0 and 1, you > > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096) > > is called. We write the page 0, writeprotect it, update timestamps. But > > page 1 is still writeable so writes to it won't set CMTIME flag, neither > > update the timestamp... Not that I think this can be reasonably solved but > > it is a food for thought. > > This should already work. AS_CMTIME is set when the pte goes from > dirty to clean, not when the pte goes from wp to writable. So > whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will > be set and a subsequent writepages call will update the timestamp. Oh, I missed that - I thought you were setting AS_CMTIME during .page_mkwrite. Setting it in clear_page_dirty_for_io() is too late for filesystems to include it in their existing transactions during .writepage, (at least for XFs and ext4) because they do their delayed allocation transactions before changing page state.... Cheers, Dave.
On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote: >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote: >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time >> >> >> update, but that would complicate non-transactional filesystems. >> >> >> Those filesystems should just flush cmtime at the end of writepages. >> >> > >> >> > do_writepages() is the wrong place to do such updates - we can get >> >> > writeback directly through .writepage, so the time updates need to >> >> > be in .writepage. That first .writepage call will clear the bit on >> >> > the mapping, so it's only done on the first call to .writepage on >> >> > the given mapping. >> >> >> >> Last time I checked, all the paths that actually needed the timestamp >> >> update went through .writepages. I'll double-check. >> > kswapd can call just .writepage to do the writeout so timestamp update >> > should be handled there as well. Otherwise all pages in a mapping can be >> > cleaned without timestamp being updated. >> >> OK, I'll fix that. >> >> > >> > Which btw made me realize that even your scheme doesn't completely make >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096) >> > is called. We write the page 0, writeprotect it, update timestamps. But >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither >> > update the timestamp... Not that I think this can be reasonably solved but >> > it is a food for thought. >> >> This should already work. AS_CMTIME is set when the pte goes from >> dirty to clean, not when the pte goes from wp to writable. So >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will >> be set and a subsequent writepages call will update the timestamp. > > Oh, I missed that - I thought you were setting AS_CMTIME during > .page_mkwrite. > > Setting it in clear_page_dirty_for_io() is too late for filesystems > to include it in their existing transactions during .writepage, (at > least for XFs and ext4) because they do their delayed allocation > transactions before changing page state.... Couldn't it go between mpage_map_and_submit_extent and ext4_journal_stop in ext4_writepages? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote: > On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote: > >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote: > >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: > >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time > >> >> >> update, but that would complicate non-transactional filesystems. > >> >> >> Those filesystems should just flush cmtime at the end of writepages. > >> >> > > >> >> > do_writepages() is the wrong place to do such updates - we can get > >> >> > writeback directly through .writepage, so the time updates need to > >> >> > be in .writepage. That first .writepage call will clear the bit on > >> >> > the mapping, so it's only done on the first call to .writepage on > >> >> > the given mapping. > >> >> > >> >> Last time I checked, all the paths that actually needed the timestamp > >> >> update went through .writepages. I'll double-check. > >> > kswapd can call just .writepage to do the writeout so timestamp update > >> > should be handled there as well. Otherwise all pages in a mapping can be > >> > cleaned without timestamp being updated. > >> > >> OK, I'll fix that. > >> > >> > > >> > Which btw made me realize that even your scheme doesn't completely make > >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you > >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096) > >> > is called. We write the page 0, writeprotect it, update timestamps. But > >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither > >> > update the timestamp... Not that I think this can be reasonably solved but > >> > it is a food for thought. > >> > >> This should already work. AS_CMTIME is set when the pte goes from > >> dirty to clean, not when the pte goes from wp to writable. So > >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will > >> be set and a subsequent writepages call will update the timestamp. > > > > Oh, I missed that - I thought you were setting AS_CMTIME during > > .page_mkwrite. > > > > Setting it in clear_page_dirty_for_io() is too late for filesystems > > to include it in their existing transactions during .writepage, (at > > least for XFs and ext4) because they do their delayed allocation > > transactions before changing page state.... > > Couldn't it go between mpage_map_and_submit_extent and > ext4_journal_stop in ext4_writepages? Maybe - I'm not an ext4 expert - but even if you can make it work for ext4 in some way, that doesn't mean it is possible for any other filesystem to use the same method. You're adding code to generic, non-filesystem specific code paths and so the solutions need to be generic rather not tied to how a specific filesystem is implemented. Cheers, Dave.
On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote: >> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote: >> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote: >> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: >> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time >> >> >> >> update, but that would complicate non-transactional filesystems. >> >> >> >> Those filesystems should just flush cmtime at the end of writepages. >> >> >> > >> >> >> > do_writepages() is the wrong place to do such updates - we can get >> >> >> > writeback directly through .writepage, so the time updates need to >> >> >> > be in .writepage. That first .writepage call will clear the bit on >> >> >> > the mapping, so it's only done on the first call to .writepage on >> >> >> > the given mapping. >> >> >> >> >> >> Last time I checked, all the paths that actually needed the timestamp >> >> >> update went through .writepages. I'll double-check. >> >> > kswapd can call just .writepage to do the writeout so timestamp update >> >> > should be handled there as well. Otherwise all pages in a mapping can be >> >> > cleaned without timestamp being updated. >> >> >> >> OK, I'll fix that. >> >> >> >> > >> >> > Which btw made me realize that even your scheme doesn't completely make >> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you >> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096) >> >> > is called. We write the page 0, writeprotect it, update timestamps. But >> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither >> >> > update the timestamp... Not that I think this can be reasonably solved but >> >> > it is a food for thought. >> >> >> >> This should already work. AS_CMTIME is set when the pte goes from >> >> dirty to clean, not when the pte goes from wp to writable. So >> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will >> >> be set and a subsequent writepages call will update the timestamp. >> > >> > Oh, I missed that - I thought you were setting AS_CMTIME during >> > .page_mkwrite. >> > >> > Setting it in clear_page_dirty_for_io() is too late for filesystems >> > to include it in their existing transactions during .writepage, (at >> > least for XFs and ext4) because they do their delayed allocation >> > transactions before changing page state.... >> >> Couldn't it go between mpage_map_and_submit_extent and >> ext4_journal_stop in ext4_writepages? > > Maybe - I'm not an ext4 expert - but even if you can make it work > for ext4 in some way, that doesn't mean it is possible for any other > filesystem to use the same method. You're adding code to generic, > non-filesystem specific code paths and so the solutions need to be > generic rather not tied to how a specific filesystem is implemented. > I don't see the problem for xfs or btrfs either. xfs uses generic_writepages, which already does the right thing. (xfs with my updated patches passes my tests.) xfs_vm_writepage calls xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is called. At that point (I presume), it would still be possible to add metadata to a transaction (assuming there's a transaction open -- I don't have a clue here). Even if this is too late, xfs_vm_writepage could call page_mkwrite to for AS_CMTIME to be set if needed. page_mkwrite will be fast if the page isn't mmapped. What am I missing? btrfs seems to do much the same thing. --Andy > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Aug 20, 2013 at 05:47:10PM -0700, Andy Lutomirski wrote: > On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote: > >> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote: > >> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote: > >> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote: > >> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time > >> >> >> >> update, but that would complicate non-transactional filesystems. > >> >> >> >> Those filesystems should just flush cmtime at the end of writepages. > >> >> >> > > >> >> >> > do_writepages() is the wrong place to do such updates - we can get > >> >> >> > writeback directly through .writepage, so the time updates need to > >> >> >> > be in .writepage. That first .writepage call will clear the bit on > >> >> >> > the mapping, so it's only done on the first call to .writepage on > >> >> >> > the given mapping. > >> >> >> > >> >> >> Last time I checked, all the paths that actually needed the timestamp > >> >> >> update went through .writepages. I'll double-check. > >> >> > kswapd can call just .writepage to do the writeout so timestamp update > >> >> > should be handled there as well. Otherwise all pages in a mapping can be > >> >> > cleaned without timestamp being updated. > >> >> > >> >> OK, I'll fix that. > >> >> > >> >> > > >> >> > Which btw made me realize that even your scheme doesn't completely make > >> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you > >> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096) > >> >> > is called. We write the page 0, writeprotect it, update timestamps. But > >> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither > >> >> > update the timestamp... Not that I think this can be reasonably solved but > >> >> > it is a food for thought. > >> >> > >> >> This should already work. AS_CMTIME is set when the pte goes from > >> >> dirty to clean, not when the pte goes from wp to writable. So > >> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will > >> >> be set and a subsequent writepages call will update the timestamp. > >> > > >> > Oh, I missed that - I thought you were setting AS_CMTIME during > >> > .page_mkwrite. > >> > > >> > Setting it in clear_page_dirty_for_io() is too late for filesystems > >> > to include it in their existing transactions during .writepage, (at > >> > least for XFs and ext4) because they do their delayed allocation > >> > transactions before changing page state.... > >> > >> Couldn't it go between mpage_map_and_submit_extent and > >> ext4_journal_stop in ext4_writepages? > > > > Maybe - I'm not an ext4 expert - but even if you can make it work > > for ext4 in some way, that doesn't mean it is possible for any other > > filesystem to use the same method. You're adding code to generic, > > non-filesystem specific code paths and so the solutions need to be > > generic rather not tied to how a specific filesystem is implemented. > > > > I don't see the problem for xfs or btrfs either. > > xfs uses generic_writepages, which already does the right thing. (xfs > with my updated patches passes my tests.) xfs_vm_writepage calls > xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is > called. At that point (I presume), it would still be possible to add > metadata to a transaction (assuming there's a transaction open -- I > don't have a clue here). That's my point - there isn't a transaction in XFS at this point, and so if we gather that flag from clear_page_dirty_for_io() we'd need a second transaction and therefore the optimisation you want filesystems to use to mitigate the additional overhead can't be done for all commonly used filesystems. > Even if this is too late, xfs_vm_writepage > could call page_mkwrite to for AS_CMTIME to be set if needed. > page_mkwrite will be fast if the page isn't mmapped. What am I > missing? That it leads to different behaviour for different filesystems. i.e. page_mkwrite on page A sets the flag. writeback on a range that doesn't include page A occurs, sees the flag, clears it after updating the timestamp. Some time later writeback on page A occurs, no timestamp update occurs. The behaviour needs to be consistent across filesystems. Cheers, Dave.
diff --git a/include/linux/fs.h b/include/linux/fs.h index 86cf0a4..f224155 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -350,6 +350,15 @@ struct address_space_operations { /* Write back some dirty pages from this mapping. */ int (*writepages)(struct address_space *, struct writeback_control *); + /* + * Userspace expects certain system calls to update cmtime if + * a file has been recently written using a shared vma. In + * cases where cmtime may need to be updated but writepages is + * not called, this is called instead. (Implementations + * should call mapping_test_clear_cmtime.) + */ + void (*flush_cmtime)(struct address_space *); + /* Set a page dirty. Return true if this dirtied it */ int (*set_page_dirty)(struct page *page); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 4e198ca..f6e8261 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -174,6 +174,7 @@ typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, int generic_writepages(struct address_space *mapping, struct writeback_control *wbc); +void generic_flush_cmtime(struct address_space *mapping); void tag_pages_for_writeback(struct address_space *mapping, pgoff_t start, pgoff_t end); int write_cache_pages(struct address_space *mapping, diff --git a/mm/mmap.c b/mm/mmap.c index 1edbaa3..7ed7700 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1,3 +1,4 @@ + /* * mm/mmap.c * @@ -249,8 +250,14 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) might_sleep(); if (vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); - if (vma->vm_file) + if (vma->vm_file) { + if ((vma->vm_flags & VM_SHARED) && vma->vm_file->f_mapping) { + struct address_space *mapping = vma->vm_file->f_mapping; + if (mapping->a_ops && mapping->a_ops->flush_cmtime) + mapping->a_ops->flush_cmtime(mapping); + } fput(vma->vm_file); + } mpol_put(vma_policy(vma)); kmem_cache_free(vm_area_cachep, vma); return next; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3f0c895..9ab8c9e 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) ret = mapping->a_ops->writepages(mapping, wbc); else ret = generic_writepages(mapping, wbc); + + /* + * ->writepages will call clear_page_dirty_for_io, which may, in turn, + * mark the mapping for deferred cmtime update. As an optimization, + * a filesystem can flush the update at the end of ->writepages + * (possibly avoiding a journal transaction, for example), but, + * for simplicity, let filesystems skip that part and just implement + * ->flush_cmtime. + */ + if (mapping->a_ops->flush_cmtime) + mapping->a_ops->flush_cmtime(mapping); + return ret; } @@ -1970,6 +1982,20 @@ int write_one_page(struct page *page, int wait) } EXPORT_SYMBOL(write_one_page); +/** + * generic_flush_cmtime - perform a deferred cmtime update if needed + * @mapping: address space structure + * + * This is a library function, which implements the flush_cmtime() + * address_space_operation. + */ +void generic_flush_cmtime(struct address_space *mapping) +{ + if (mapping_test_clear_cmtime(mapping)) + inode_update_time_writable(mapping->host); +} +EXPORT_SYMBOL(generic_flush_cmtime); + /* * For address_spaces which do not use buffers nor write back. */
Filesystems that defer cmtime updates should update cmtime when any of these events happen after a write via a mapping: - The mapping is written back to disk. This happens from all kinds of places, all of which eventually call ->writepages. - munmap is called or the mapping is removed when the process exits - msync(MS_ASYNC) is called. Linux currently does nothing for msync(MS_ASYNC), but POSIX says that cmtime should be updated some time between an mmaped write and the subsequent msync call. MS_SYNC calls ->writepages, but MS_ASYNC needs special handling. Filesystmes that defer cmtime updates should flush them on munmap or exit. Finding out that this happened through vm_ops is messy, so add a new address space op for this. It's not strictly necessary to call ->flush_cmtime after ->writepages, but it simplifies the fs code. As an optional optimization, filesystems can call mapping_test_clear_cmtime themselves in ->writepages (as long as they're careful to scan all the pages first -- the cmtime bit may not be set when ->writepages is entered). This patch does not implement the MS_ASYNC case; that's in the next patch. Signed-off-by: Andy Lutomirski <luto@amacapital.net> --- include/linux/fs.h | 9 +++++++++ include/linux/writeback.h | 1 + mm/mmap.c | 9 ++++++++- mm/page-writeback.c | 26 ++++++++++++++++++++++++++ 4 files changed, 44 insertions(+), 1 deletion(-)