Message ID | 1306767829-27580-1-git-send-email-jack@suse.cz |
---|---|
State | New, archived |
Headers | show |
On Mon, May 30, 2011 at 05:03:49PM +0200, Jan Kara wrote: > ext4_page_mkwrite() does not return page locked. This makes it hard > to avoid races with filesystem freezing code (so that we don't leave > writeable page on a frozen fs) or writeback code (so that we allow page > to be stable during writeback). > > Also the current code uses i_alloc_sem to avoid races with truncate but that > seems to be the wrong locking order according to lock ordering documented in > mm/rmap.c. > > Also add a check for frozen filesystem so that we don't busyloop in page fault > when the filesystem is frozen. > > Signed-off-by: Jan Kara <jack@suse.cz> I'm not sure this commit description is accurate --- or I'm horribly confused. The old code was in fact returning the page locked --- I assume you're referring to lock_page(page) (which is called right before the normal path return). So what am I missing? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ping? - Ted On Sun, Jun 05, 2011 at 10:17:54PM -0400, Ted Ts'o wrote: > On Mon, May 30, 2011 at 05:03:49PM +0200, Jan Kara wrote: > > ext4_page_mkwrite() does not return page locked. This makes it hard > > to avoid races with filesystem freezing code (so that we don't leave > > writeable page on a frozen fs) or writeback code (so that we allow page > > to be stable during writeback). > > > > Also the current code uses i_alloc_sem to avoid races with truncate but that > > seems to be the wrong locking order according to lock ordering documented in > > mm/rmap.c. > > > > Also add a check for frozen filesystem so that we don't busyloop in page fault > > when the filesystem is frozen. > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > I'm not sure this commit description is accurate --- or I'm horribly > confused. > > The old code was in fact returning the page locked --- I assume you're > referring to lock_page(page) (which is called right before the normal > path return). So what am I missing? > > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun 05-06-11 22:17:54, Ted Tso wrote: > On Mon, May 30, 2011 at 05:03:49PM +0200, Jan Kara wrote: > > ext4_page_mkwrite() does not return page locked. This makes it hard > > to avoid races with filesystem freezing code (so that we don't leave > > writeable page on a frozen fs) or writeback code (so that we allow page > > to be stable during writeback). > > > > Also the current code uses i_alloc_sem to avoid races with truncate but that > > seems to be the wrong locking order according to lock ordering documented in > > mm/rmap.c. > > > > Also add a check for frozen filesystem so that we don't busyloop in page fault > > when the filesystem is frozen. > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > I'm not sure this commit description is accurate --- or I'm horribly > confused. > > The old code was in fact returning the page locked --- I assume you're > referring to lock_page(page) (which is called right before the normal > path return). So what am I missing? OK, so I read the code again and you are right that end_page_writeback() will be called before the work is queued and thus there are no problems with i_mutex (both with multiblock-IO code and with the old code). Sorry for confusion. But reading the code I have a few questions: In multiblock io submission code we call end_page_writeback() and drop page references before we queue work converting extent from uninitialized to initialized. But what prevents mm from reclaiming the page (and possibly loading zeros from disk) before we finish the conversion? I suppose that could be another cause of the problem described in 1449032be17abb69116dbc393f67ceb8bd034f92 but I don't see how that would be addressed. Related question is: Commit bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of inode reference when creating/freeing end IO work. But now I don't see what prevents mm from reaping the inode before the workqueue gets to processing it. Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the old IO submission code but apparently it forgot to return the old handling of uninitialized buffers so we unconditionnaly call block_write_full_page() without specifying end_io function. So AFAICS we never convert unwritten extents to written in some cases. For example when I mount the fs as: mount -t ext4 -o nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600); char buf[1024]; memset(buf, 'a', sizeof(buf)); fallocate(fd, 0, 0, 16384); write(fd, buf, sizeof(buf)); I get a file full of zeros (after remounting the filesystem so that pagecache is dropped) instead of seeing the first KB contain 'a's. So to return to our original discussion regarding i_mutex - you are right currently i_mutex does not cause a deadlock but to fix the first of the above problems we need to somehow pin the page before all metadata is properly updated, that metadata update requires i_mutex, and we will have to be careful to pin the page in such a way so that memory reclaim cannot get stuck reaping that page... If you can solve that, I'll be happy but currently I'd find it more robust to just call end_page_writeback() after we convert extents and avoid fs recursion from allocations. Honza
On Wed, Jun 08, 2011 at 04:55:49PM +0200, Jan Kara wrote: > OK, so I read the code again and you are right that end_page_writeback() > will be called before the work is queued and thus there are no problems > with i_mutex (both with multiblock-IO code and with the old code). Sorry for > confusion. But reading the code I have a few questions: > In multiblock io submission code we call end_page_writeback() and drop > page references before we queue work converting extent from uninitialized > to initialized. But what prevents mm from reclaiming the page (and possibly > loading zeros from disk) before we finish the conversion? We hold a reference on the page (i.e., via get_page()), which we don't drop until we are doing doing the conversion. The workqueue function (ext4_end_io_work, in fs/ext4/page_io.c) calls ext4_free_io_end(), which drops the page reference. > Related question is: Commit > bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of > inode reference when creating/freeing end IO work. But now I don't see what > prevents mm from reaping the inode before the workqueue gets to processing > it. Actually it was commit f7ad6d2e9201a which removed the getting and putting of the inode reference, and it describes the issues involved. Specifically: The following BUG can occur when an inode which is getting freed when it still has dirty pages outstanding, and it gets deleted (in this because it was the target of a rename). In ordered mode, we need to make sure the data pages are written just in case we crash before the rename (or unlink) is committed. If the inode is being freed then when we try to igrab the inode, we end up tripping the BUG_ON at fs/ext4/page-io.c:146. To solve this problem, we need to keep track of the number of io callbacks which are pending, and avoid destroying the inode until they have all been completed. That way we don't have to bump the inode count to keep the inode from being destroyed; an approach which doesn't work because the count could have already been dropped down to zero before the inode writeback has started (at which point we're not allowed to bump the count back up to 1, since it's already started getting freed). Thanks to Dave Chinner for suggesting this approach, which is also used by XFS. > Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the > old IO submission code but apparently it forgot to return the old handling > of uninitialized buffers so we unconditionnaly call block_write_full_page() > without specifying end_io function. Yeah. That was a bug, no question. No one apparently ran into it (or at least no one bothered to report it to linux-ext4 or linux-kernel as far as I know) while it was the default (i.e., during 2.6.37), and as of 2.6.38 we are now using the new IO submission code by default again. Given that no one has reported any problems with the new IO submission code, I was planning on completely removing the nomblk_io_submit mount option for the 3.1 kernel, which would fix the bug by virtue of nuking code in question. :-) > So to return to our original discussion regarding i_mutex - you are right > currently i_mutex does not cause a deadlock but to fix the first of the > above problems we need to somehow pin the page before all metadata is > properly updated, that metadata update requires i_mutex, and we will have > to be careful to pin the page in such a way so that memory reclaim cannot > get stuck reaping that page... If you can solve that, I'll be happy but > currently I'd find it more robust to just call end_page_writeback() after > we convert extents and avoid fs recursion from allocations. So I think the current fs/ext4/page_io.c code solves the problem simply by grabbing a reference on the bug, which should prevent the mm layer from reclaiming the page. Does that satisfy you? Do you see anything that I may have missed? (I very well could have missed something; after all, this code is subtle, which is why I didn't feel comfortable responsible until I could find the time to go and research this in depth; sorry for delay in getting back to you.) - Ted P.S. Upon further reflection, perhaps we need to drop a few more code comments in it so it's clearer to folks who try to go through this code path in the future. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun 19-06-11 22:53:14, Ted Tso wrote: > On Wed, Jun 08, 2011 at 04:55:49PM +0200, Jan Kara wrote: > > OK, so I read the code again and you are right that end_page_writeback() > > will be called before the work is queued and thus there are no problems > > with i_mutex (both with multiblock-IO code and with the old code). Sorry for > > confusion. But reading the code I have a few questions: > > In multiblock io submission code we call end_page_writeback() and drop > > page references before we queue work converting extent from uninitialized > > to initialized. But what prevents mm from reclaiming the page (and possibly > > loading zeros from disk) before we finish the conversion? > > We hold a reference on the page (i.e., via get_page()), which we don't > drop until we are doing doing the conversion. The workqueue function > (ext4_end_io_work, in fs/ext4/page_io.c) calls ext4_free_io_end(), > which drops the page reference. Hmm, let's go though the call chain and see where I miss something: io_end is allocated and bio is initialized in io_submit_init() - we set bi_end_io to ext4_end_bio there pages are added to io_end ext4_bio_write_page() - it creates io_page (p_count is set to 1, we grab page reference - get_page()), then io_submit_add_bh() adds io_page to io_end (p_count increased to 2), then we do put_io_page() in the end of ext4_bio_write_page() -> p_count == 1. Eventually bio is submitted, io gets completed and ext4_end_bio() is called. It goes though all pages and does put_io_page(), thus p_count gets to 0, we clear PageWriteback and do put_page(). From this moment on, we hold no reference to the page AFAICS. Note, there is a separate call chain from ext4_writepage(). There the reference handling seems to be as you describe - i.e., we get reference in ext4_set_bh_endio(), we call end_buffer_async_write() in ext4_end_io_buffer_write() which clears PageWriteback and then we drop page reference in ext4_free_io_end(). > > Related question is: Commit > > bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of > > inode reference when creating/freeing end IO work. But now I don't see what > > prevents mm from reaping the inode before the workqueue gets to processing > > it. > > Actually it was commit f7ad6d2e9201a which removed the getting and > putting of the inode reference, and it describes the issues involved. Ah, OK. Thanks for the pointer. > > Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the > > old IO submission code but apparently it forgot to return the old handling > > of uninitialized buffers so we unconditionnaly call block_write_full_page() > > without specifying end_io function. > > Yeah. That was a bug, no question. No one apparently ran into it (or > at least no one bothered to report it to linux-ext4 or linux-kernel as > far as I know) while it was the default (i.e., during 2.6.37), and as > of 2.6.38 we are now using the new IO submission code by default > again. Given that no one has reported any problems with the new IO > submission code, I was planning on completely removing the > nomblk_io_submit mount option for the 3.1 kernel, which would fix the > bug by virtue of nuking code in question. :-) Yep, that would solve it as well :). > > So to return to our original discussion regarding i_mutex - you are right > > currently i_mutex does not cause a deadlock but to fix the first of the > > above problems we need to somehow pin the page before all metadata is > > properly updated, that metadata update requires i_mutex, and we will have > > to be careful to pin the page in such a way so that memory reclaim cannot > > get stuck reaping that page... If you can solve that, I'll be happy but > > currently I'd find it more robust to just call end_page_writeback() after > > we convert extents and avoid fs recursion from allocations. > > So I think the current fs/ext4/page_io.c code solves the problem > simply by grabbing a reference on the bug, which should prevent the mm > layer from reclaiming the page. Does that satisfy you? Do you see > anything that I may have missed? OK, if we fix the reference counting I'm now convinced that the code would be correct. Fixing that does not seem to be completely simple - we need to use p_count to detect when we can do end_page_writeback() when blocksize < pagesize and io submission races with io completion (current code is correct in that respect). But we need another mechanism detect when we can safely put page reference... Separate reference counter would certainly do but it might be an overkill. Possibly we could get additional page reference for each io_page reference? > P.S. Upon further reflection, perhaps we need to drop a few more code > comments in it so it's clearer to folks who try to go through this > code path in the future. Well, I think it would be worthwhile to have a highlevel description somewhere (i.e. what protects inode from freeing, what protects page from freeing, how we manage PageWriteback, page references and extent conversion). The actual code is not that hard to follow but it's not easy to see the design behind it... Honza
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a5763e3..d679dd8 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5843,80 +5843,84 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) struct page *page = vmf->page; loff_t size; unsigned long len; - int ret = -EINVAL; - void *fsdata; + int ret; struct file *file = vma->vm_file; struct inode *inode = file->f_path.dentry->d_inode; struct address_space *mapping = inode->i_mapping; + handle_t *handle; + get_block_t *get_block; + int retries = 0; /* - * Get i_alloc_sem to stop truncates messing with the inode. We cannot - * get i_mutex because we are already holding mmap_sem. + * This check is racy but catches the common case. We rely on + * __block_page_mkwrite() to do a reliable check. */ - down_read(&inode->i_alloc_sem); - size = i_size_read(inode); - if (page->mapping != mapping || size <= page_offset(page) - || !PageUptodate(page)) { - /* page got truncated from under us? */ - goto out_unlock; + vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE); + /* Delalloc case is easy... */ + if (test_opt(inode->i_sb, DELALLOC) && + !ext4_should_journal_data(inode) && + !ext4_nonda_switch(inode->i_sb)) { + do { + ret = __block_page_mkwrite(vma, vmf, + ext4_da_get_block_prep); + } while (ret == -ENOSPC && + ext4_should_retry_alloc(inode->i_sb, &retries)); + goto out_ret; } - ret = 0; lock_page(page); - wait_on_page_writeback(page); - if (PageMappedToDisk(page)) { - up_read(&inode->i_alloc_sem); - return VM_FAULT_LOCKED; + size = i_size_read(inode); + /* Page got truncated from under us? */ + if (page->mapping != mapping || page_offset(page) > size) { + unlock_page(page); + ret = VM_FAULT_NOPAGE; + goto out; } if (page->index == size >> PAGE_CACHE_SHIFT) len = size & ~PAGE_CACHE_MASK; else len = PAGE_CACHE_SIZE; - /* - * return if we have all the buffers mapped. This avoid - * the need to call write_begin/write_end which does a - * journal_start/journal_stop which can block and take - * long time + * Return if we have all the buffers mapped. This avoids the need to do + * journal_start/journal_stop which can block and take a long time */ if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { - up_read(&inode->i_alloc_sem); - return VM_FAULT_LOCKED; + /* Wait so that we don't change page under IO */ + wait_on_page_writeback(page); + ret = VM_FAULT_LOCKED; + goto out; } } unlock_page(page); - /* - * OK, we need to fill the hole... Do write_begin write_end - * to do block allocation/reservation.We are not holding - * inode.i__mutex here. That allow * parallel write_begin, - * write_end call. lock_page prevent this from happening - * on the same page though - */ - ret = mapping->a_ops->write_begin(file, mapping, page_offset(page), - len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata); - if (ret < 0) - goto out_unlock; - ret = mapping->a_ops->write_end(file, mapping, page_offset(page), - len, len, page, fsdata); - if (ret < 0) - goto out_unlock; - ret = 0; - - /* - * write_begin/end might have created a dirty page and someone - * could wander in and start the IO. Make sure that hasn't - * happened. - */ - lock_page(page); - wait_on_page_writeback(page); - up_read(&inode->i_alloc_sem); - return VM_FAULT_LOCKED; -out_unlock: - if (ret) + /* OK, we need to fill the hole... */ + if (ext4_should_dioread_nolock(inode)) + get_block = ext4_get_block_write; + else + get_block = ext4_get_block; +retry_alloc: + handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode)); + if (IS_ERR(handle)) { ret = VM_FAULT_SIGBUS; - up_read(&inode->i_alloc_sem); + goto out; + } + ret = __block_page_mkwrite(vma, vmf, get_block); + if (!ret && ext4_should_journal_data(inode)) { + if (walk_page_buffers(handle, page_buffers(page), 0, + PAGE_CACHE_SIZE, NULL, do_journal_get_write_access)) { + unlock_page(page); + ret = VM_FAULT_SIGBUS; + goto out; + } + ext4_set_inode_state(inode, EXT4_STATE_JDATA); + } + ext4_journal_stop(handle); + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry_alloc; +out_ret: + ret = block_page_mkwrite_return(ret); +out: return ret; }
ext4_page_mkwrite() does not return page locked. This makes it hard to avoid races with filesystem freezing code (so that we don't leave writeable page on a frozen fs) or writeback code (so that we allow page to be stable during writeback). Also the current code uses i_alloc_sem to avoid races with truncate but that seems to be the wrong locking order according to lock ordering documented in mm/rmap.c. Also add a check for frozen filesystem so that we don't busyloop in page fault when the filesystem is frozen. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 106 ++++++++++++++++++++++++++++-------------------------- 1 files changed, 55 insertions(+), 51 deletions(-) Ted, necessary generic changes went upstream so I've rebased this patch on top of 3.0-rc1 (which also resolves the conflict with Darrick's patches).