[v5,04/13] mm: Add readahead address space operation

Message ID	20200211010348.6872-5-willy@infradead.org
State	Not Applicable
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> From: Matthew Wilcox <willy@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, linux-ext4@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com, ocfs2-devel@oss.oracle.com, linux-xfs@vger.kernel.org Subject: [PATCH v5 04/13] mm: Add readahead address space operation Date: Mon, 10 Feb 2020 17:03:39 -0800 Message-Id: <20200211010348.6872-5-willy@infradead.org> In-Reply-To: <20200211010348.6872-1-willy@infradead.org> References: <20200211010348.6872-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk
Series	Change readahead API \| expand [v5,00/13] Change readahead API [v5,01/13] mm: Fix the return type of __do_page_cache_readahead [v5,02/13] mm: Ignore return value of ->readpages [v5,03/13] mm: Put readahead pages in cache earlier [v5,04/13] mm: Add readahead address space operation [v5,05/13] mm: Add page_cache_readahead_limit [v5,06/13] fs: Convert mpage_readpages to mpage_readahead [v5,07/13] btrfs: Convert from readpages to readahead [v5,08/13] erofs: Convert uncompressed files from readpages to readahead [v5,09/13] erofs: Convert compressed files from readpages to readahead [v5,10/13] ext4: Convert from readpages to readahead [v5,11/13] f2fs: Convert from readpages to readahead [v5,12/13] fuse: Convert from readpages to readahead [v5,13/13] iomap: Convert from readpages to readahead

Matthew Wilcox Feb. 11, 2020, 1:03 a.m. UTC

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This replaces ->readpages with a saner interface:
 - Return void instead of an ignored error code.
 - Pages are already in the page cache when ->readahead is called.
 - Implementation looks up the pages in the page cache instead of
   having them passed in a linked list.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst |  6 ++-
 Documentation/filesystems/vfs.rst     | 13 +++++++
 include/linux/fs.h                    |  2 +
 include/linux/pagemap.h               | 54 +++++++++++++++++++++++++++
 mm/readahead.c                        | 48 ++++++++++++++----------
 5 files changed, 102 insertions(+), 21 deletions(-)

Dave Chinner Feb. 11, 2020, 4:52 a.m. UTC | #1

On Mon, Feb 10, 2020 at 05:03:39PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.
>  - Implementation looks up the pages in the page cache instead of
>    having them passed in a linked list.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

....

>  
> +/*
> + * Readahead is of a block of consecutive pages.
> + */
> +struct readahead_control {
> +	struct file *file;
> +	struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */
> +	pgoff_t start;
> +	unsigned int nr_pages;
> +	unsigned int batch_count;
> +};
> +
> +static inline struct page *readahead_page(struct readahead_control *rac)
> +{
> +	struct page *page;
> +
> +	if (!rac->nr_pages)
> +		return NULL;
> +
> +	page = xa_load(&rac->mapping->i_pages, rac->start);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	rac->batch_count = hpage_nr_pages(page);
> +	rac->start += rac->batch_count;

There's no mention of large page support in the patch description
and I don't recall this sort of large page batching in previous
iterations.

This seems like new functionality to me, not directly related to
the initial ->readahead API change? What have I missed?

Cheers,

Dave.

Matthew Wilcox Feb. 11, 2020, 12:54 p.m. UTC | #2

On Tue, Feb 11, 2020 at 03:52:30PM +1100, Dave Chinner wrote:
> > +struct readahead_control {
> > +	struct file *file;
> > +	struct address_space *mapping;
> > +/* private: use the readahead_* accessors instead */
> > +	pgoff_t start;
> > +	unsigned int nr_pages;
> > +	unsigned int batch_count;
> > +};
> > +
> > +static inline struct page *readahead_page(struct readahead_control *rac)
> > +{
> > +	struct page *page;
> > +
> > +	if (!rac->nr_pages)
> > +		return NULL;
> > +
> > +	page = xa_load(&rac->mapping->i_pages, rac->start);
> > +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > +	rac->batch_count = hpage_nr_pages(page);
> > +	rac->start += rac->batch_count;
> 
> There's no mention of large page support in the patch description
> and I don't recall this sort of large page batching in previous
> iterations.
> 
> This seems like new functionality to me, not directly related to
> the initial ->readahead API change? What have I missed?

I had a crisis of confidence when I was working on this -- the loop
originally looked like this:

#define readahead_for_each(rac, page)                                   \
        for (; (page = readahead_page(rac)); rac->nr_pages--)

and then I started thinking about what I'd need to do to support large
pages, and that turned into

#define readahead_for_each(rac, page)                                   \
        for (; (page = readahead_page(rac));				\
		rac->nr_pages -= hpage_nr_pages(page))

but I realised that was potentially a use-after-free because 'page' has
certainly had put_page() called on it by then.  I had a brief period
where I looked at moving put_page() away from being the filesystem's
responsibility and into the iterator, but that would introduce more
changes into the patchset, as well as causing problems for filesystems
that want to break out of the loop.

By this point, I was also looking at the readahead_for_each_batch()
iterator that btrfs uses, and so we have the batch count anyway, and we
might as well use it to store the number of subpages of the large page.
And so it became easier to just put the whole ball of wax into the initial
patch set, rather than introduce the iterator now and then fix it up in
the patch set that I'm basing on this.

So yes, there's a certain amount of excess functionality in this patch
set ... I can remove it for the next release.

Dave Chinner Feb. 11, 2020, 8:08 p.m. UTC | #3

On Tue, Feb 11, 2020 at 04:54:13AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 11, 2020 at 03:52:30PM +1100, Dave Chinner wrote:
> > > +struct readahead_control {
> > > +	struct file *file;
> > > +	struct address_space *mapping;
> > > +/* private: use the readahead_* accessors instead */
> > > +	pgoff_t start;
> > > +	unsigned int nr_pages;
> > > +	unsigned int batch_count;
> > > +};
> > > +
> > > +static inline struct page *readahead_page(struct readahead_control *rac)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	if (!rac->nr_pages)
> > > +		return NULL;
> > > +
> > > +	page = xa_load(&rac->mapping->i_pages, rac->start);
> > > +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > +	rac->batch_count = hpage_nr_pages(page);
> > > +	rac->start += rac->batch_count;
> > 
> > There's no mention of large page support in the patch description
> > and I don't recall this sort of large page batching in previous
> > iterations.
> > 
> > This seems like new functionality to me, not directly related to
> > the initial ->readahead API change? What have I missed?
> 
> I had a crisis of confidence when I was working on this -- the loop
> originally looked like this:
> 
> #define readahead_for_each(rac, page)                                   \
>         for (; (page = readahead_page(rac)); rac->nr_pages--)
> 
> and then I started thinking about what I'd need to do to support large
> pages, and that turned into
> 
> #define readahead_for_each(rac, page)                                   \
>         for (; (page = readahead_page(rac));				\
> 		rac->nr_pages -= hpage_nr_pages(page))
> 
> but I realised that was potentially a use-after-free because 'page' has
> certainly had put_page() called on it by then.  I had a brief period
> where I looked at moving put_page() away from being the filesystem's
> responsibility and into the iterator, but that would introduce more
> changes into the patchset, as well as causing problems for filesystems
> that want to break out of the loop.
> 
> By this point, I was also looking at the readahead_for_each_batch()
> iterator that btrfs uses, and so we have the batch count anyway, and we
> might as well use it to store the number of subpages of the large page.
> And so it became easier to just put the whole ball of wax into the initial
> patch set, rather than introduce the iterator now and then fix it up in
> the patch set that I'm basing on this.
> 
> So yes, there's a certain amount of excess functionality in this patch
> set ... I can remove it for the next release.

I'd say "Just document it" as that was the main reason I noticed it.
Or perhaps add the batching function as a stand-alone patch so it's
clear that the batch interface solves two problems at once - large
pages and the btrfs page batching implementation...

Cheers,

Dave.

Christoph Hellwig Feb. 12, 2020, 6:18 p.m. UTC | #4

On Mon, Feb 10, 2020 at 05:03:39PM -0800, Matthew Wilcox wrote:
> +struct readahead_control {
> +	struct file *file;
> +	struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */
> +	pgoff_t start;
> +	unsigned int nr_pages;
> +	unsigned int batch_count;

We often use __ prefixes for the private fields to make that a little
more clear.

John Hubbard Feb. 14, 2020, 5:36 a.m. UTC | #5

On 2/10/20 5:03 PM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.
>  - Implementation looks up the pages in the page cache instead of
>    having them passed in a linked list.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  Documentation/filesystems/locking.rst |  6 ++-
>  Documentation/filesystems/vfs.rst     | 13 +++++++
>  include/linux/fs.h                    |  2 +
>  include/linux/pagemap.h               | 54 +++++++++++++++++++++++++++
>  mm/readahead.c                        | 48 ++++++++++++++----------
>  5 files changed, 102 insertions(+), 21 deletions(-)
> 

A minor question below, but either way you can add:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>



> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index 5057e4d9dcd1..0ebc4491025a 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -239,6 +239,7 @@ prototypes::
>  	int (*readpage)(struct file *, struct page *);
>  	int (*writepages)(struct address_space *, struct writeback_control *);
>  	int (*set_page_dirty)(struct page *page);
> +	void (*readahead)(struct readahead_control *);
>  	int (*readpages)(struct file *filp, struct address_space *mapping,
>  			struct list_head *pages, unsigned nr_pages);
>  	int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -271,7 +272,8 @@ writepage:		yes, unlocks (see below)
>  readpage:		yes, unlocks
>  writepages:
>  set_page_dirty		no
> -readpages:
> +readahead:		yes, unlocks
> +readpages:		no
>  write_begin:		locks the page		 exclusive
>  write_end:		yes, unlocks		 exclusive
>  bmap:
> @@ -295,6 +297,8 @@ the request handler (/dev/loop).
>  ->readpage() unlocks the page, either synchronously or via I/O
>  completion.
>  
> +->readahead() unlocks the pages like ->readpage().
> +
>  ->readpages() populates the pagecache with the passed pages and starts
>  I/O against them.  They come unlocked upon I/O completion.
>  
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..cabee16b7406 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are defined:
>  		int (*readpage)(struct file *, struct page *);
>  		int (*writepages)(struct address_space *, struct writeback_control *);
>  		int (*set_page_dirty)(struct page *page);
> +		void (*readahead)(struct readahead_control *);
>  		int (*readpages)(struct file *filp, struct address_space *mapping,
>  				 struct list_head *pages, unsigned nr_pages);
>  		int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are defined:
>  	If defined, it should set the PageDirty flag, and the
>  	PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> +	Called by the VM to read pages associated with the address_space
> +	object.  The pages are consecutive in the page cache and are
> +	locked.  The implementation should decrement the page refcount
> +	after starting I/O on each page.  Usually the page will be
> +	unlocked by the I/O completion handler.  If the function does
> +	not attempt I/O on some pages, the caller will decrement the page
> +	refcount and unlock the pages for you.	Set PageUptodate if the
> +	I/O completes successfully.  Setting PageError on any page will
> +	be ignored; simply unlock the page if an I/O error occurs.
> +
>  ``readpages``
>  	called by the VM to read pages associated with the address_space
>  	object.  This is essentially just a vector version of readpage.
>  	Instead of just one page, several pages are requested.
>  	readpages is only used for read-ahead, so read errors are
>  	ignored.  If anything goes wrong, feel free to give up.
> +        This interface is deprecated; implement readahead instead.
>  
>  ``write_begin``
>  	Called by the generic buffered write code to ask the filesystem
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..d4e2d2964346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -292,6 +292,7 @@ enum positive_aop_returns {
>  struct page;
>  struct address_space;
>  struct writeback_control;
> +struct readahead_control;
>  
>  /*
>   * Write life time hint values.
> @@ -375,6 +376,7 @@ struct address_space_operations {
>  	 */
>  	int (*readpages)(struct file *filp, struct address_space *mapping,
>  			struct list_head *pages, unsigned nr_pages);
> +	void (*readahead)(struct readahead_control *);
>  
>  	int (*write_begin)(struct file *, struct address_space *mapping,
>  				loff_t pos, unsigned len, unsigned flags,
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..13efafaf7e1f 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -630,6 +630,60 @@ static inline int add_to_page_cache(struct page *page,
>  	return error;
>  }
>  
> +/*
> + * Readahead is of a block of consecutive pages.
> + */
> +struct readahead_control {
> +	struct file *file;
> +	struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */
> +	pgoff_t start;
> +	unsigned int nr_pages;
> +	unsigned int batch_count;
> +};
> +
> +static inline struct page *readahead_page(struct readahead_control *rac)
> +{
> +	struct page *page;
> +
> +	if (!rac->nr_pages)
> +		return NULL;
> +
> +	page = xa_load(&rac->mapping->i_pages, rac->start);


Is it worth asserting that the page was found:

	VM_BUG_ON_PAGE(!page || xa_is_value(page), page);

? Or is that overkill here?


> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	rac->batch_count = hpage_nr_pages(page);
> +	rac->start += rac->batch_count;


The above was surprising, until I saw the other thread with Dave and you.
I was reviewing this patchset in order to have a chance at understanding the 
follow-on patchset ("Large pages in the page cache"), and it seems like that
feature has a solid head start here. :)  


thanks,

Matthew Wilcox Feb. 15, 2020, 1:15 a.m. UTC | #6

On Thu, Feb 13, 2020 at 09:36:25PM -0800, John Hubbard wrote:
> > +static inline struct page *readahead_page(struct readahead_control *rac)
> > +{
> > +	struct page *page;
> > +
> > +	if (!rac->nr_pages)
> > +		return NULL;
> > +
> > +	page = xa_load(&rac->mapping->i_pages, rac->start);
> 
> 
> Is it worth asserting that the page was found:
> 
> 	VM_BUG_ON_PAGE(!page || xa_is_value(page), page);
> 
> ? Or is that overkill here?

It shouldn't be possible since they were just added in a locked state.
If it did happen, it'll be caught by the assert below -- dereferencing
a NULL pointer or a shadow entry is not going to go well.

> > +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > +	rac->batch_count = hpage_nr_pages(page);
> > +	rac->start += rac->batch_count;
> 
> The above was surprising, until I saw the other thread with Dave and you.
> I was reviewing this patchset in order to have a chance at understanding the 
> follow-on patchset ("Large pages in the page cache"), and it seems like that
> feature has a solid head start here. :)  

Right, I'll document that.

[v5,04/13] mm: Add readahead address space operation

Commit Message

Comments

Patch