ext4: Fix deadlock on page reclaim

Message ID	20190725093358.30679-1-damien.lemoal@wdc.com
State	New
Headers	show Return-Path: <linux-ext4-owner@vger.kernel.org> IronPort-SDR: PMG0JJBjX/CfWnbE0QqzMgCTkv6PFW2++Y5gvRHYQ7arRsm+wzDiD6x9PagPsa02BnIfnkQPIy xvKJWbHPWSrFSamrrhKWfI2+hds+My9y9+O1S+3mS6ZOvkfg4X9TJJfwr3bAVvzi7T9dasKlK+ F84aKk9+ZdnzcCAUmFEgwP7HESlf1RlaguLizDBX2uALnCe4uLnvOVxtGIo/z5wxjakaLac2q1 gDdkKDrAY9zxZwWm6AZJSuSyZjoPa6DSJJZfFmQBSP5Qkmy2goXeQZk7qvxJZWli0SuEMzN3Wy gz4= IronPort-SDR: vlqaeLX+7jKxqVgCDzdeJX8wPBYOLwoAr2f6LfQ4alFHWias5mfJ+6P3Om3fQhxWb1LZv4/dN9 vJHgw5LrbfNJpaTpOL3VtMzp5cuSVUkQN/vU1myiP4oso3Kj4OqvxFGUFTQuDndR6d1RZh5oRw NbM4VpMqSaIVyE36xx6nE7OVwR2XZSUj2TFt60YS+F8ejVTerpOWjOMw2LA0qQUuo7kw8AdnLS BzshEW6gk9sbkRskYOP8nM+LPU4L5SQ11b9y/NW5hXTSsbTS5apkuy2Md1DCS3p5lQ3vKos65c ja2WKrEpcb1o8MrLTdeiAK7z IronPort-SDR: BQxM+7SVLejzr93XUQnPHDQSzbG+QoRfGmF1y+cm/KleY1GKgyyFrSB+rF2y30+rWl0T1cwvMp Ym5toaQOA0gk9hLQF6d07+ugBsqTMxEP0MoF4BwotEltBsGC9CX2icOVxmaPC1ZHXwL/sSFZQa 7dk4sXgNkfwAia1VA6mcnugJq/HfCADnd5gSK0oHCjooguJT4JX7N/EH4Hfb9ohC9Z/Ttfe05T /piWp/+msz7NaEj26WsJNDMHxNVt1FsOa4ynoC8tYODKdRNrxxblSrUIa+ATM7n+T/1ZUdHy+E V6o= From: Damien Le Moal <damien.lemoal@wdc.com> To: Theodore Ts'o <tytso@mit.edu>, Andreas Dilger <adilger.kernel@dilger.ca>, linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Johannes Thumshirn <jthumshirn@suse.de>, Naohiro Aota <naohiro.aota@wdc.com>, Masato Suzuki <masato.suzuki@wdc.com> Subject: [PATCH] ext4: Fix deadlock on page reclaim Date: Thu, 25 Jul 2019 18:33:58 +0900 Message-Id: <20190725093358.30679-1-damien.lemoal@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk
Series	ext4: Fix deadlock on page reclaim \| expand ext4: Fix deadlock on page reclaim

Damien Le Moal July 25, 2019, 9:33 a.m. UTC

In ext4_[da_]write_begin(), grab_cache_page_write_begin() is being
called without GFP_NOFS set for the context. This is considered adequate
as any eventual memory reclaim triggered by a page allocation is being
done before the transaction handle for the write operation is started.

However, with the following setup:

* fo creating and writing files on XFS
* XFS file system on top of dm-zoned target device
* dm-zoned target created on tcmu-runner emulated ZBC disk
* emulated ZBC disk backend file on ext4
* ext4 file system on regular SATA SSD

A deadlock was observed under the heavy file write workload generated
using fio. The deadlock is clearly apparent from the backtrace of the
tcmu-runner handler task backtrace:

tcmu-runner call Trace:
wait_for_completion+0x12c/0x170		<-- deadlock (see below text)
xfs_buf_iowait+0x39/0x1c0 [xfs]
__xfs_buf_submit+0x118/0x2e0 [xfs]	<-- XFS issues writes to
				            dm-zoned device, which in
					    turn will issue writes to
					    the emulated ZBC device
					    handled by tcmu-runner.
xfs_bwrite+0x25/0x60 [xfs]
xfs_reclaim_inode+0x303/0x330 [xfs]
xfs_reclaim_inodes_ag+0x223/0x450 [xfs]
xfs_reclaim_inodes_nr+0x31/0x40 [xfs]	<-- Absence of GFP_NOFS allows
					    reclaim to call in XFS
					    reclaim for the XFS file
					    system on the emulated
					    device
super_cache_scan+0x153/0x1a0
do_shrink_slab+0x17b/0x3c0
shrink_slab+0x170/0x2c0
shrink_node+0x1d6/0x4a0
do_try_to_free_pages+0xdb/0x3c0
try_to_free_pages+0x112/0x2e0		<-- Page reclaim triggers
__alloc_pages_slowpath+0x422/0x1020
__alloc_pages_nodemask+0x37f/0x400
pagecache_get_page+0xb4/0x390
grab_cache_page_write_begin+0x1d/0x40
ext4_da_write_begin+0xd6/0x530
generic_perform_write+0xc2/0x1e0
__generic_file_write_iter+0xf9/0x1d0
ext4_file_write_iter+0xc6/0x3b0
new_sync_write+0x12d/0x1d0
vfs_write+0xdb/0x1d0
ksys_pwrite64+0x65/0xa0		<-- tcmu-runner ZBC handler writes to
				    the backend file in response to
				    dm-zoned target write IO

When XFS reclaim issues write IOs to the dm-zoned target device,
dm-zoned issue write IOs to the tcmu-runner emulated device. However,
the tcmu ZBC handler is singled threaded and blocked waiting for the
completion of the started pwrite() call and so does not process the
newly issued write IOs necessary to complete page reclaim. The system
is in a deadlocked state. This problem is 100% reproducible, the
deadlock happening after fio running for a few minutes.

A similar deadlock was also observed for read operations into the ext4
backend file on page cache miss trigering a page allocation and page
reclaim. Switching the tcmu emulated ZBC disk backend file from an ext4
file to an XFS file, none of these deadlocks are observed.

Fix this problem by removing __GFP_FS from ext4 inode mapping gfp_mask.
The code used for this fix is borrowed from XFS xfs_setup_inode(). The
inode mapping gfp_mask initialization is added to ext4_set_aops().

Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 fs/ext4/inode.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Christoph Hellwig July 25, 2019, 11:54 a.m. UTC | #1

On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
> +	gfp_t gfp_mask;
> +
>  	switch (ext4_inode_journal_mode(inode)) {
>  	case EXT4_INODE_ORDERED_DATA_MODE:
>  	case EXT4_INODE_WRITEBACK_DATA_MODE:
> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>  		inode->i_mapping->a_ops = &ext4_da_aops;
>  	else
>  		inode->i_mapping->a_ops = &ext4_aops;
> +
> +	/*
> +	 * Ensure all page cache allocations are done from GFP_NOFS context to
> +	 * prevent direct reclaim recursion back into the filesystem and blowing
> +	 * stacks or deadlocking.
> +	 */
> +	gfp_mask = mapping_gfp_mask(inode->i_mapping);
> +	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));

This looks like something that could hit every file systems, so
shouldn't we fix this in common code?  We could also look into
just using memalloc_nofs_save for the page cache allocation path
instead of the per-mapping gfp_mask.

>  }
>  
>  static int __ext4_block_zero_page_range(handle_t *handle,
> -- 
> 2.21.0
> 
---end quoted text---

Andreas Dilger July 25, 2019, 7:57 p.m. UTC | #2

On Jul 25, 2019, at 5:54 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
>> +	gfp_t gfp_mask;
>> +
>> 	switch (ext4_inode_journal_mode(inode)) {
>> 	case EXT4_INODE_ORDERED_DATA_MODE:
>> 	case EXT4_INODE_WRITEBACK_DATA_MODE:
>> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>> 		inode->i_mapping->a_ops = &ext4_da_aops;
>> 	else
>> 		inode->i_mapping->a_ops = &ext4_aops;
>> +
>> +	/*
>> +	 * Ensure all page cache allocations are done from GFP_NOFS context to
>> +	 * prevent direct reclaim recursion back into the filesystem and blowing
>> +	 * stacks or deadlocking.
>> +	 */
>> +	gfp_mask = mapping_gfp_mask(inode->i_mapping);
>> +	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?

It also has the drawback that it prevents __GFP_FS reclaim when ext4
is *not* at the bottom of the IO stack.

> We could also look into just using memalloc_nofs_save for the page
> cache allocation path instead of the per-mapping gfp_mask.

That makes more sense.

Cheers, Andreas

Damien Le Moal July 26, 2019, 2:24 a.m. UTC | #3

On 2019/07/25 20:54, Christoph Hellwig wrote:
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
>> +	gfp_t gfp_mask;
>> +
>>  	switch (ext4_inode_journal_mode(inode)) {
>>  	case EXT4_INODE_ORDERED_DATA_MODE:
>>  	case EXT4_INODE_WRITEBACK_DATA_MODE:
>> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>>  		inode->i_mapping->a_ops = &ext4_da_aops;
>>  	else
>>  		inode->i_mapping->a_ops = &ext4_aops;
>> +
>> +	/*
>> +	 * Ensure all page cache allocations are done from GFP_NOFS context to
>> +	 * prevent direct reclaim recursion back into the filesystem and blowing
>> +	 * stacks or deadlocking.
>> +	 */
>> +	gfp_mask = mapping_gfp_mask(inode->i_mapping);
>> +	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

I agree. I did a quick scan and it looks to me like not all file systems are
using GFP_NOFS when grabbing pages for read or write. XFS, btrfs, f2fs, jfs,
gfs2 and ramfs seem OK, but I did not dig very deep for others where the use of
GFP_NOFS is not obvious.

I am not sure though how to approach a global fix. At the very least, it may be
good to have GFP_NOFS set by default in the inode mapping flags in
__address_space_init_once() which is called from inode_init_once(). But I am not
sure if that is enough nor if all file systems are using this function.

The other method as you suggest would be to add calls to
memalloc_nofs_save/restore() in functions like grab_cache_page_write_begin(),
but since that would cover writes only, we may want to do that at a higher level
in the various generic_xxx() and mpage_xxx() helper functions to cover more
ground. But that would still not be enough for the files systems not using these
helpers (plenty of examples for that for the write path).

Or do we go as high to VFS layer to add memalloc_nofs_save/restore() calls ?
That would be a big hammer fix... I personally think this would be OK though,
but I may be missing some points here.

Thoughts on the best approach ?

Best regards.

Dave Chinner July 26, 2019, 10:44 p.m. UTC | #4

On Thu, Jul 25, 2019 at 04:54:42AM -0700, Christoph Hellwig wrote:
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
> > +	gfp_t gfp_mask;
> > +
> >  	switch (ext4_inode_journal_mode(inode)) {
> >  	case EXT4_INODE_ORDERED_DATA_MODE:
> >  	case EXT4_INODE_WRITEBACK_DATA_MODE:
> > @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
> >  		inode->i_mapping->a_ops = &ext4_da_aops;
> >  	else
> >  		inode->i_mapping->a_ops = &ext4_aops;
> > +
> > +	/*
> > +	 * Ensure all page cache allocations are done from GFP_NOFS context to
> > +	 * prevent direct reclaim recursion back into the filesystem and blowing
> > +	 * stacks or deadlocking.
> > +	 */
> > +	gfp_mask = mapping_gfp_mask(inode->i_mapping);
> > +	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

I think it has to be the entire IO path - any allocation from the
underlying filesystem could recurse into the top level filesystem
and then deadlock if the memory reclaim submits IO or blocks on
IO completion from the upper filesystem. That's a bloody big hammer
for something that is only necessary when there are stacked
filesystems like this....

Cheers,

Dave.

Theodore Ts'o July 26, 2019, 10:55 p.m. UTC | #5

On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
> > 
> > This looks like something that could hit every file systems, so
> > shouldn't we fix this in common code?  We could also look into
> > just using memalloc_nofs_save for the page cache allocation path
> > instead of the per-mapping gfp_mask.
> 
> I think it has to be the entire IO path - any allocation from the
> underlying filesystem could recurse into the top level filesystem
> and then deadlock if the memory reclaim submits IO or blocks on
> IO completion from the upper filesystem. That's a bloody big hammer
> for something that is only necessary when there are stacked
> filesystems like this....

Yeah.... that's why using memalloc_nofs_save() probably makes the most
sense, and dm_zoned should use that before it calls into ext4.

       	   	    	       	    	   - Ted

Damien Le Moal July 27, 2019, 2:59 a.m. UTC | #6

On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>
>>> This looks like something that could hit every file systems, so
>>> shouldn't we fix this in common code?  We could also look into
>>> just using memalloc_nofs_save for the page cache allocation path
>>> instead of the per-mapping gfp_mask.
>>
>> I think it has to be the entire IO path - any allocation from the
>> underlying filesystem could recurse into the top level filesystem
>> and then deadlock if the memory reclaim submits IO or blocks on
>> IO completion from the upper filesystem. That's a bloody big hammer
>> for something that is only necessary when there are stacked
>> filesystems like this....
> 
> Yeah.... that's why using memalloc_nofs_save() probably makes the most
> sense, and dm_zoned should use that before it calls into ext4.

Unfortunately, with this particular setup, that will not solve the problem.
dm-zoned submit BIOs to its backend drive in response to XFS activity. The
requests for these BIOs are passed along to the kernel tcmu HBA and end up in
that HBA command ring. The commands themselves are read from the ring and
executed by the tcmu-runner user process which executes them doing
pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
context than the dm-zoned worker thread issuing the BIO,
memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

We tried a simpler setup using loopback mount (XFS used directly in an ext4
file) and running the same workload. We failed to recreate a similar deadlock in
this case, but I am strongly suspecting that it can happen too. It is simply
much harder to hit because the IO path from XFS to ext4 is all in-kernel and
asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs
which makes it relatively easy to get inter-dependent writes or read+write
queued back-to-back and create the deadlock.

So back to Dave's point, we may be needing the big-hammer solution in the case
of stacked file systems, while a non-stack setups do not necessarily need it
(that is for the FS to decide). But I do not see how to implement this big
hammer conditionally. How can a file system tell if it is at the top of the
stack (big hammer not needed) or lower than the top level (big hammer needed) ?

One simple hack would be an fcntl() or mount option to tell the FS to use
GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
applications or system setup is correct. So not so safe.

Dave Chinner July 28, 2019, 11:41 p.m. UTC | #7

On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote:
> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
> > On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
> >>>
> >>> This looks like something that could hit every file systems, so
> >>> shouldn't we fix this in common code?  We could also look into
> >>> just using memalloc_nofs_save for the page cache allocation path
> >>> instead of the per-mapping gfp_mask.
> >>
> >> I think it has to be the entire IO path - any allocation from the
> >> underlying filesystem could recurse into the top level filesystem
> >> and then deadlock if the memory reclaim submits IO or blocks on
> >> IO completion from the upper filesystem. That's a bloody big hammer
> >> for something that is only necessary when there are stacked
> >> filesystems like this....
> > 
> > Yeah.... that's why using memalloc_nofs_save() probably makes the most
> > sense, and dm_zoned should use that before it calls into ext4.
> 
> Unfortunately, with this particular setup, that will not solve the problem.
> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
> that HBA command ring. The commands themselves are read from the ring and
> executed by the tcmu-runner user process which executes them doing
> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
> context than the dm-zoned worker thread issuing the BIO,
> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

Right, I'm talking about using memalloc_nofs_save() as a huge hammer
in the pread/pwrite() calling context, not the bio submission
context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
below).

> So back to Dave's point, we may be needing the big-hammer solution in the case
> of stacked file systems, while a non-stack setups do not necessarily need it
> (that is for the FS to decide). But I do not see how to implement this big
> hammer conditionally. How can a file system tell if it is at the top of the
> stack (big hammer not needed) or lower than the top level (big hammer needed) ?

Age old question - tracking arbitrary nesting through mount/fs/bdev
layers is ... difficult. There is no easy way to tell.

> One simple hack would be an fcntl() or mount option to tell the FS to use
> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
> applications or system setup is correct. So not so safe.

Wasn't there discussion at some point in the past about an interface
for special processes to be able to mark themselves as PF_MEMALLOC
(some kind of prctl, I think) for things like FUSE daemons? That
would prevent direct reclaim recursion for these userspace daemons
that are in the kernel memory reclaim IO path. It's the same
situation there, isn't it? How does fuse deal with this problem?

Cheers,

Dave,

Damien Le Moal July 29, 2019, 1:07 a.m. UTC | #8

On 2019/07/29 8:42, Dave Chinner wrote:
> On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote:
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>>>
>>>>> This looks like something that could hit every file systems, so
>>>>> shouldn't we fix this in common code?  We could also look into
>>>>> just using memalloc_nofs_save for the page cache allocation path
>>>>> instead of the per-mapping gfp_mask.
>>>>
>>>> I think it has to be the entire IO path - any allocation from the
>>>> underlying filesystem could recurse into the top level filesystem
>>>> and then deadlock if the memory reclaim submits IO or blocks on
>>>> IO completion from the upper filesystem. That's a bloody big hammer
>>>> for something that is only necessary when there are stacked
>>>> filesystems like this....
>>>
>>> Yeah.... that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
> 
> Right, I'm talking about using memalloc_nofs_save() as a huge hammer
> in the pread/pwrite() calling context, not the bio submission
> context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
> below).

Yes, I understood your point. And I agree that it indeed would be a big hammer.
We should be able to do better than that :)

>> One simple hack would be an fcntl() or mount option to tell the FS to use
>> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
>> applications or system setup is correct. So not so safe.
> 
> Wasn't there discussion at some point in the past about an interface
> for special processes to be able to mark themselves as PF_MEMALLOC
> (some kind of prctl, I think) for things like FUSE daemons? That
> would prevent direct reclaim recursion for these userspace daemons
> that are in the kernel memory reclaim IO path. It's the same
> situation there, isn't it? How does fuse deal with this problem?

I do not recall such discussion. But indeed FUSE may give some hints. Good idea.
Thanks. I will check.

Andreas Dilger July 29, 2019, 6:40 p.m. UTC | #9

On Jul 26, 2019, at 8:59 PM, Damien Le Moal <damien.lemoal@wdc.com> wrote:
> 
> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>> 
>>>> This looks like something that could hit every file systems, so
>>>> shouldn't we fix this in common code?  We could also look into
>>>> just using memalloc_nofs_save for the page cache allocation path
>>>> instead of the per-mapping gfp_mask.
>>> 
>>> I think it has to be the entire IO path - any allocation from the
>>> underlying filesystem could recurse into the top level filesystem
>>> and then deadlock if the memory reclaim submits IO or blocks on
>>> IO completion from the upper filesystem. That's a bloody big hammer
>>> for something that is only necessary when there are stacked
>>> filesystems like this....
>> 
>> Yeah.... that's why using memalloc_nofs_save() probably makes the most
>> sense, and dm_zoned should use that before it calls into ext4.
> 
> Unfortunately, with this particular setup, that will not solve the problem.
> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
> that HBA command ring. The commands themselves are read from the ring and
> executed by the tcmu-runner user process which executes them doing
> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
> context than the dm-zoned worker thread issuing the BIO,
> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
in the BIO so that the worker thread will also get the correct behaviour
when it is processing this IO submission.

> We tried a simpler setup using loopback mount (XFS used directly in an ext4
> file) and running the same workload. We failed to recreate a similar deadlock in
> this case, but I am strongly suspecting that it can happen too. It is simply
> much harder to hit because the IO path from XFS to ext4 is all in-kernel and
> asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs
> which makes it relatively easy to get inter-dependent writes or read+write
> queued back-to-back and create the deadlock.
> 
> So back to Dave's point, we may be needing the big-hammer solution in the case
> of stacked file systems, while a non-stack setups do not necessarily need it
> (that is for the FS to decide). But I do not see how to implement this big
> hammer conditionally. How can a file system tell if it is at the top of the
> stack (big hammer not needed) or lower than the top level (big hammer needed) ?
> 
> One simple hack would be an fcntl() or mount option to tell the FS to use
> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
> applications or system setup is correct. So not so safe.

Cheers, Andreas

Damien Le Moal July 30, 2019, 2:06 a.m. UTC | #10

Andreas,

On 2019/07/30 3:40, Andreas Dilger wrote:
> On Jul 26, 2019, at 8:59 PM, Damien Le Moal <damien.lemoal@wdc.com> wrote:
>>
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>>>
>>>>> This looks like something that could hit every file systems, so
>>>>> shouldn't we fix this in common code?  We could also look into
>>>>> just using memalloc_nofs_save for the page cache allocation path
>>>>> instead of the per-mapping gfp_mask.
>>>>
>>>> I think it has to be the entire IO path - any allocation from the
>>>> underlying filesystem could recurse into the top level filesystem
>>>> and then deadlock if the memory reclaim submits IO or blocks on
>>>> IO completion from the upper filesystem. That's a bloody big hammer
>>>> for something that is only necessary when there are stacked
>>>> filesystems like this....
>>>
>>> Yeah.... that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
> 
> One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
> in the BIO so that the worker thread will also get the correct behaviour
> when it is processing this IO submission.

But there is no BIO to work with here: dm-zoned BIOs are transformed into an
application (tcmu-runner running the ZBC file handler) calls to pread()/pwrite()
into an ext4 file. And processing of these calls trigger the direct reclaim and
FS recursion which cause the deadlock. So BIOs are not the proper vehicle to
transport the information about the NOFS allocation.

If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the
cleanest approach.

Best regards.

Dave Chinner July 30, 2019, 11:47 p.m. UTC | #11

On Tue, Jul 30, 2019 at 02:06:33AM +0000, Damien Le Moal wrote:
> If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
> RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the
> cleanest approach.

Clean, yes, but I'm not sure we want to expose kernel memory reclaim
capabilities to userspace... It would be misleading, too, because we
still want to allow reclaim to occur, just not have reclaim recurse
into other filesystems....

Cheers,

Dave.

Damien Le Moal July 31, 2019, 4:37 a.m. UTC | #12

Dave,

On 2019/07/31 8:48, Dave Chinner wrote:
> On Tue, Jul 30, 2019 at 02:06:33AM +0000, Damien Le Moal wrote:
>> If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
>> RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the
>> cleanest approach.
> 
> Clean, yes, but I'm not sure we want to expose kernel memory reclaim
> capabilities to userspace... It would be misleading, too, because we
> still want to allow reclaim to occur, just not have reclaim recurse
> into other filesystems....

When I wrote RWF_NORECLAIM, I was really thinking of RWF_NOFSRECLAIM. So
suppressing direct reclaim recursing into another FS rather than completely
disabling reclaim. Sorry for the confusion.

Would this be better ? This is still application controlled, so debatable if
control should be given. Most likely this would need to be limited to CAP_SYS
capable user processes (e.g. root processes).

I still need to check on FUSE if anything at all along these lines exists there.
I will dig.

Best regards.

ext4: Fix deadlock on page reclaim

Commit Message

Comments

Patch