Message ID | 20190725093358.30679-1-damien.lemoal@wdc.com |
---|---|
State | New |
Headers | show |
Series | ext4: Fix deadlock on page reclaim | expand |
On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote: > + gfp_t gfp_mask; > + > switch (ext4_inode_journal_mode(inode)) { > case EXT4_INODE_ORDERED_DATA_MODE: > case EXT4_INODE_WRITEBACK_DATA_MODE: > @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode) > inode->i_mapping->a_ops = &ext4_da_aops; > else > inode->i_mapping->a_ops = &ext4_aops; > + > + /* > + * Ensure all page cache allocations are done from GFP_NOFS context to > + * prevent direct reclaim recursion back into the filesystem and blowing > + * stacks or deadlocking. > + */ > + gfp_mask = mapping_gfp_mask(inode->i_mapping); > + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS))); This looks like something that could hit every file systems, so shouldn't we fix this in common code? We could also look into just using memalloc_nofs_save for the page cache allocation path instead of the per-mapping gfp_mask. > } > > static int __ext4_block_zero_page_range(handle_t *handle, > -- > 2.21.0 > ---end quoted text---
On Jul 25, 2019, at 5:54 AM, Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote: >> + gfp_t gfp_mask; >> + >> switch (ext4_inode_journal_mode(inode)) { >> case EXT4_INODE_ORDERED_DATA_MODE: >> case EXT4_INODE_WRITEBACK_DATA_MODE: >> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode) >> inode->i_mapping->a_ops = &ext4_da_aops; >> else >> inode->i_mapping->a_ops = &ext4_aops; >> + >> + /* >> + * Ensure all page cache allocations are done from GFP_NOFS context to >> + * prevent direct reclaim recursion back into the filesystem and blowing >> + * stacks or deadlocking. >> + */ >> + gfp_mask = mapping_gfp_mask(inode->i_mapping); >> + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS))); > > This looks like something that could hit every file systems, so > shouldn't we fix this in common code? It also has the drawback that it prevents __GFP_FS reclaim when ext4 is *not* at the bottom of the IO stack. > We could also look into just using memalloc_nofs_save for the page > cache allocation path instead of the per-mapping gfp_mask. That makes more sense. Cheers, Andreas
On 2019/07/25 20:54, Christoph Hellwig wrote: > On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote: >> + gfp_t gfp_mask; >> + >> switch (ext4_inode_journal_mode(inode)) { >> case EXT4_INODE_ORDERED_DATA_MODE: >> case EXT4_INODE_WRITEBACK_DATA_MODE: >> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode) >> inode->i_mapping->a_ops = &ext4_da_aops; >> else >> inode->i_mapping->a_ops = &ext4_aops; >> + >> + /* >> + * Ensure all page cache allocations are done from GFP_NOFS context to >> + * prevent direct reclaim recursion back into the filesystem and blowing >> + * stacks or deadlocking. >> + */ >> + gfp_mask = mapping_gfp_mask(inode->i_mapping); >> + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS))); > > This looks like something that could hit every file systems, so > shouldn't we fix this in common code? We could also look into > just using memalloc_nofs_save for the page cache allocation path > instead of the per-mapping gfp_mask. I agree. I did a quick scan and it looks to me like not all file systems are using GFP_NOFS when grabbing pages for read or write. XFS, btrfs, f2fs, jfs, gfs2 and ramfs seem OK, but I did not dig very deep for others where the use of GFP_NOFS is not obvious. I am not sure though how to approach a global fix. At the very least, it may be good to have GFP_NOFS set by default in the inode mapping flags in __address_space_init_once() which is called from inode_init_once(). But I am not sure if that is enough nor if all file systems are using this function. The other method as you suggest would be to add calls to memalloc_nofs_save/restore() in functions like grab_cache_page_write_begin(), but since that would cover writes only, we may want to do that at a higher level in the various generic_xxx() and mpage_xxx() helper functions to cover more ground. But that would still not be enough for the files systems not using these helpers (plenty of examples for that for the write path). Or do we go as high to VFS layer to add memalloc_nofs_save/restore() calls ? That would be a big hammer fix... I personally think this would be OK though, but I may be missing some points here. Thoughts on the best approach ? Best regards.
On Thu, Jul 25, 2019 at 04:54:42AM -0700, Christoph Hellwig wrote: > On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote: > > + gfp_t gfp_mask; > > + > > switch (ext4_inode_journal_mode(inode)) { > > case EXT4_INODE_ORDERED_DATA_MODE: > > case EXT4_INODE_WRITEBACK_DATA_MODE: > > @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode) > > inode->i_mapping->a_ops = &ext4_da_aops; > > else > > inode->i_mapping->a_ops = &ext4_aops; > > + > > + /* > > + * Ensure all page cache allocations are done from GFP_NOFS context to > > + * prevent direct reclaim recursion back into the filesystem and blowing > > + * stacks or deadlocking. > > + */ > > + gfp_mask = mapping_gfp_mask(inode->i_mapping); > > + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS))); > > This looks like something that could hit every file systems, so > shouldn't we fix this in common code? We could also look into > just using memalloc_nofs_save for the page cache allocation path > instead of the per-mapping gfp_mask. I think it has to be the entire IO path - any allocation from the underlying filesystem could recurse into the top level filesystem and then deadlock if the memory reclaim submits IO or blocks on IO completion from the upper filesystem. That's a bloody big hammer for something that is only necessary when there are stacked filesystems like this.... Cheers, Dave.
On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: > > > > This looks like something that could hit every file systems, so > > shouldn't we fix this in common code? We could also look into > > just using memalloc_nofs_save for the page cache allocation path > > instead of the per-mapping gfp_mask. > > I think it has to be the entire IO path - any allocation from the > underlying filesystem could recurse into the top level filesystem > and then deadlock if the memory reclaim submits IO or blocks on > IO completion from the upper filesystem. That's a bloody big hammer > for something that is only necessary when there are stacked > filesystems like this.... Yeah.... that's why using memalloc_nofs_save() probably makes the most sense, and dm_zoned should use that before it calls into ext4. - Ted
On 2019/07/27 7:55, Theodore Y. Ts'o wrote: > On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: >>> >>> This looks like something that could hit every file systems, so >>> shouldn't we fix this in common code? We could also look into >>> just using memalloc_nofs_save for the page cache allocation path >>> instead of the per-mapping gfp_mask. >> >> I think it has to be the entire IO path - any allocation from the >> underlying filesystem could recurse into the top level filesystem >> and then deadlock if the memory reclaim submits IO or blocks on >> IO completion from the upper filesystem. That's a bloody big hammer >> for something that is only necessary when there are stacked >> filesystems like this.... > > Yeah.... that's why using memalloc_nofs_save() probably makes the most > sense, and dm_zoned should use that before it calls into ext4. Unfortunately, with this particular setup, that will not solve the problem. dm-zoned submit BIOs to its backend drive in response to XFS activity. The requests for these BIOs are passed along to the kernel tcmu HBA and end up in that HBA command ring. The commands themselves are read from the ring and executed by the tcmu-runner user process which executes them doing pread()/pwrite() to the ext4 file. The tcmu-runner process being a different context than the dm-zoned worker thread issuing the BIO, memalloc_nofs_save/restore() calls in dm-zoned will have no effect. We tried a simpler setup using loopback mount (XFS used directly in an ext4 file) and running the same workload. We failed to recreate a similar deadlock in this case, but I am strongly suspecting that it can happen too. It is simply much harder to hit because the IO path from XFS to ext4 is all in-kernel and asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs which makes it relatively easy to get inter-dependent writes or read+write queued back-to-back and create the deadlock. So back to Dave's point, we may be needing the big-hammer solution in the case of stacked file systems, while a non-stack setups do not necessarily need it (that is for the FS to decide). But I do not see how to implement this big hammer conditionally. How can a file system tell if it is at the top of the stack (big hammer not needed) or lower than the top level (big hammer needed) ? One simple hack would be an fcntl() or mount option to tell the FS to use GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the applications or system setup is correct. So not so safe.
On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote: > On 2019/07/27 7:55, Theodore Y. Ts'o wrote: > > On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: > >>> > >>> This looks like something that could hit every file systems, so > >>> shouldn't we fix this in common code? We could also look into > >>> just using memalloc_nofs_save for the page cache allocation path > >>> instead of the per-mapping gfp_mask. > >> > >> I think it has to be the entire IO path - any allocation from the > >> underlying filesystem could recurse into the top level filesystem > >> and then deadlock if the memory reclaim submits IO or blocks on > >> IO completion from the upper filesystem. That's a bloody big hammer > >> for something that is only necessary when there are stacked > >> filesystems like this.... > > > > Yeah.... that's why using memalloc_nofs_save() probably makes the most > > sense, and dm_zoned should use that before it calls into ext4. > > Unfortunately, with this particular setup, that will not solve the problem. > dm-zoned submit BIOs to its backend drive in response to XFS activity. The > requests for these BIOs are passed along to the kernel tcmu HBA and end up in > that HBA command ring. The commands themselves are read from the ring and > executed by the tcmu-runner user process which executes them doing > pread()/pwrite() to the ext4 file. The tcmu-runner process being a different > context than the dm-zoned worker thread issuing the BIO, > memalloc_nofs_save/restore() calls in dm-zoned will have no effect. Right, I'm talking about using memalloc_nofs_save() as a huge hammer in the pread/pwrite() calling context, not the bio submission context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO below). > So back to Dave's point, we may be needing the big-hammer solution in the case > of stacked file systems, while a non-stack setups do not necessarily need it > (that is for the FS to decide). But I do not see how to implement this big > hammer conditionally. How can a file system tell if it is at the top of the > stack (big hammer not needed) or lower than the top level (big hammer needed) ? Age old question - tracking arbitrary nesting through mount/fs/bdev layers is ... difficult. There is no easy way to tell. > One simple hack would be an fcntl() or mount option to tell the FS to use > GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the > applications or system setup is correct. So not so safe. Wasn't there discussion at some point in the past about an interface for special processes to be able to mark themselves as PF_MEMALLOC (some kind of prctl, I think) for things like FUSE daemons? That would prevent direct reclaim recursion for these userspace daemons that are in the kernel memory reclaim IO path. It's the same situation there, isn't it? How does fuse deal with this problem? Cheers, Dave,
On 2019/07/29 8:42, Dave Chinner wrote: > On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote: >> On 2019/07/27 7:55, Theodore Y. Ts'o wrote: >>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: >>>>> >>>>> This looks like something that could hit every file systems, so >>>>> shouldn't we fix this in common code? We could also look into >>>>> just using memalloc_nofs_save for the page cache allocation path >>>>> instead of the per-mapping gfp_mask. >>>> >>>> I think it has to be the entire IO path - any allocation from the >>>> underlying filesystem could recurse into the top level filesystem >>>> and then deadlock if the memory reclaim submits IO or blocks on >>>> IO completion from the upper filesystem. That's a bloody big hammer >>>> for something that is only necessary when there are stacked >>>> filesystems like this.... >>> >>> Yeah.... that's why using memalloc_nofs_save() probably makes the most >>> sense, and dm_zoned should use that before it calls into ext4. >> >> Unfortunately, with this particular setup, that will not solve the problem. >> dm-zoned submit BIOs to its backend drive in response to XFS activity. The >> requests for these BIOs are passed along to the kernel tcmu HBA and end up in >> that HBA command ring. The commands themselves are read from the ring and >> executed by the tcmu-runner user process which executes them doing >> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different >> context than the dm-zoned worker thread issuing the BIO, >> memalloc_nofs_save/restore() calls in dm-zoned will have no effect. > > Right, I'm talking about using memalloc_nofs_save() as a huge hammer > in the pread/pwrite() calling context, not the bio submission > context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO > below). Yes, I understood your point. And I agree that it indeed would be a big hammer. We should be able to do better than that :) >> One simple hack would be an fcntl() or mount option to tell the FS to use >> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the >> applications or system setup is correct. So not so safe. > > Wasn't there discussion at some point in the past about an interface > for special processes to be able to mark themselves as PF_MEMALLOC > (some kind of prctl, I think) for things like FUSE daemons? That > would prevent direct reclaim recursion for these userspace daemons > that are in the kernel memory reclaim IO path. It's the same > situation there, isn't it? How does fuse deal with this problem? I do not recall such discussion. But indeed FUSE may give some hints. Good idea. Thanks. I will check.
On Jul 26, 2019, at 8:59 PM, Damien Le Moal <damien.lemoal@wdc.com> wrote: > > On 2019/07/27 7:55, Theodore Y. Ts'o wrote: >> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: >>>> >>>> This looks like something that could hit every file systems, so >>>> shouldn't we fix this in common code? We could also look into >>>> just using memalloc_nofs_save for the page cache allocation path >>>> instead of the per-mapping gfp_mask. >>> >>> I think it has to be the entire IO path - any allocation from the >>> underlying filesystem could recurse into the top level filesystem >>> and then deadlock if the memory reclaim submits IO or blocks on >>> IO completion from the upper filesystem. That's a bloody big hammer >>> for something that is only necessary when there are stacked >>> filesystems like this.... >> >> Yeah.... that's why using memalloc_nofs_save() probably makes the most >> sense, and dm_zoned should use that before it calls into ext4. > > Unfortunately, with this particular setup, that will not solve the problem. > dm-zoned submit BIOs to its backend drive in response to XFS activity. The > requests for these BIOs are passed along to the kernel tcmu HBA and end up in > that HBA command ring. The commands themselves are read from the ring and > executed by the tcmu-runner user process which executes them doing > pread()/pwrite() to the ext4 file. The tcmu-runner process being a different > context than the dm-zoned worker thread issuing the BIO, > memalloc_nofs_save/restore() calls in dm-zoned will have no effect. One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state in the BIO so that the worker thread will also get the correct behaviour when it is processing this IO submission. > We tried a simpler setup using loopback mount (XFS used directly in an ext4 > file) and running the same workload. We failed to recreate a similar deadlock in > this case, but I am strongly suspecting that it can happen too. It is simply > much harder to hit because the IO path from XFS to ext4 is all in-kernel and > asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs > which makes it relatively easy to get inter-dependent writes or read+write > queued back-to-back and create the deadlock. > > So back to Dave's point, we may be needing the big-hammer solution in the case > of stacked file systems, while a non-stack setups do not necessarily need it > (that is for the FS to decide). But I do not see how to implement this big > hammer conditionally. How can a file system tell if it is at the top of the > stack (big hammer not needed) or lower than the top level (big hammer needed) ? > > One simple hack would be an fcntl() or mount option to tell the FS to use > GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the > applications or system setup is correct. So not so safe. Cheers, Andreas
Andreas, On 2019/07/30 3:40, Andreas Dilger wrote: > On Jul 26, 2019, at 8:59 PM, Damien Le Moal <damien.lemoal@wdc.com> wrote: >> >> On 2019/07/27 7:55, Theodore Y. Ts'o wrote: >>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: >>>>> >>>>> This looks like something that could hit every file systems, so >>>>> shouldn't we fix this in common code? We could also look into >>>>> just using memalloc_nofs_save for the page cache allocation path >>>>> instead of the per-mapping gfp_mask. >>>> >>>> I think it has to be the entire IO path - any allocation from the >>>> underlying filesystem could recurse into the top level filesystem >>>> and then deadlock if the memory reclaim submits IO or blocks on >>>> IO completion from the upper filesystem. That's a bloody big hammer >>>> for something that is only necessary when there are stacked >>>> filesystems like this.... >>> >>> Yeah.... that's why using memalloc_nofs_save() probably makes the most >>> sense, and dm_zoned should use that before it calls into ext4. >> >> Unfortunately, with this particular setup, that will not solve the problem. >> dm-zoned submit BIOs to its backend drive in response to XFS activity. The >> requests for these BIOs are passed along to the kernel tcmu HBA and end up in >> that HBA command ring. The commands themselves are read from the ring and >> executed by the tcmu-runner user process which executes them doing >> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different >> context than the dm-zoned worker thread issuing the BIO, >> memalloc_nofs_save/restore() calls in dm-zoned will have no effect. > > One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state > in the BIO so that the worker thread will also get the correct behaviour > when it is processing this IO submission. But there is no BIO to work with here: dm-zoned BIOs are transformed into an application (tcmu-runner running the ZBC file handler) calls to pread()/pwrite() into an ext4 file. And processing of these calls trigger the direct reclaim and FS recursion which cause the deadlock. So BIOs are not the proper vehicle to transport the information about the NOFS allocation. If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the cleanest approach. Best regards.
On Tue, Jul 30, 2019 at 02:06:33AM +0000, Damien Le Moal wrote: > If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a > RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the > cleanest approach. Clean, yes, but I'm not sure we want to expose kernel memory reclaim capabilities to userspace... It would be misleading, too, because we still want to allow reclaim to occur, just not have reclaim recurse into other filesystems.... Cheers, Dave.
Dave, On 2019/07/31 8:48, Dave Chinner wrote: > On Tue, Jul 30, 2019 at 02:06:33AM +0000, Damien Le Moal wrote: >> If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a >> RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the >> cleanest approach. > > Clean, yes, but I'm not sure we want to expose kernel memory reclaim > capabilities to userspace... It would be misleading, too, because we > still want to allow reclaim to occur, just not have reclaim recurse > into other filesystems.... When I wrote RWF_NORECLAIM, I was really thinking of RWF_NOFSRECLAIM. So suppressing direct reclaim recursing into another FS rather than completely disabling reclaim. Sorry for the confusion. Would this be better ? This is still application controlled, so debatable if control should be given. Most likely this would need to be limited to CAP_SYS capable user processes (e.g. root processes). I still need to check on FUSE if anything at all along these lines exists there. I will dig. Best regards.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 420fe3deed39..f882929037df 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1292,8 +1292,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, * grab_cache_page_write_begin() can take a long time if the * system is thrashing due to memory pressure, or if the page * is being written back. So grab it first before we start - * the transaction handle. This also allows us to allocate - * the page (if needed) without using GFP_NOFS. + * the transaction handle. */ retry_grab: page = grab_cache_page_write_begin(mapping, index, flags); @@ -3084,8 +3083,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping, * grab_cache_page_write_begin() can take a long time if the * system is thrashing due to memory pressure, or if the page * is being written back. So grab it first before we start - * the transaction handle. This also allows us to allocate - * the page (if needed) without using GFP_NOFS. + * the transaction handle. */ retry_grab: page = grab_cache_page_write_begin(mapping, index, flags); @@ -4003,6 +4001,8 @@ static const struct address_space_operations ext4_dax_aops = { void ext4_set_aops(struct inode *inode) { + gfp_t gfp_mask; + switch (ext4_inode_journal_mode(inode)) { case EXT4_INODE_ORDERED_DATA_MODE: case EXT4_INODE_WRITEBACK_DATA_MODE: @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode) inode->i_mapping->a_ops = &ext4_da_aops; else inode->i_mapping->a_ops = &ext4_aops; + + /* + * Ensure all page cache allocations are done from GFP_NOFS context to + * prevent direct reclaim recursion back into the filesystem and blowing + * stacks or deadlocking. + */ + gfp_mask = mapping_gfp_mask(inode->i_mapping); + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS))); } static int __ext4_block_zero_page_range(handle_t *handle,
In ext4_[da_]write_begin(), grab_cache_page_write_begin() is being called without GFP_NOFS set for the context. This is considered adequate as any eventual memory reclaim triggered by a page allocation is being done before the transaction handle for the write operation is started. However, with the following setup: * fo creating and writing files on XFS * XFS file system on top of dm-zoned target device * dm-zoned target created on tcmu-runner emulated ZBC disk * emulated ZBC disk backend file on ext4 * ext4 file system on regular SATA SSD A deadlock was observed under the heavy file write workload generated using fio. The deadlock is clearly apparent from the backtrace of the tcmu-runner handler task backtrace: tcmu-runner call Trace: wait_for_completion+0x12c/0x170 <-- deadlock (see below text) xfs_buf_iowait+0x39/0x1c0 [xfs] __xfs_buf_submit+0x118/0x2e0 [xfs] <-- XFS issues writes to dm-zoned device, which in turn will issue writes to the emulated ZBC device handled by tcmu-runner. xfs_bwrite+0x25/0x60 [xfs] xfs_reclaim_inode+0x303/0x330 [xfs] xfs_reclaim_inodes_ag+0x223/0x450 [xfs] xfs_reclaim_inodes_nr+0x31/0x40 [xfs] <-- Absence of GFP_NOFS allows reclaim to call in XFS reclaim for the XFS file system on the emulated device super_cache_scan+0x153/0x1a0 do_shrink_slab+0x17b/0x3c0 shrink_slab+0x170/0x2c0 shrink_node+0x1d6/0x4a0 do_try_to_free_pages+0xdb/0x3c0 try_to_free_pages+0x112/0x2e0 <-- Page reclaim triggers __alloc_pages_slowpath+0x422/0x1020 __alloc_pages_nodemask+0x37f/0x400 pagecache_get_page+0xb4/0x390 grab_cache_page_write_begin+0x1d/0x40 ext4_da_write_begin+0xd6/0x530 generic_perform_write+0xc2/0x1e0 __generic_file_write_iter+0xf9/0x1d0 ext4_file_write_iter+0xc6/0x3b0 new_sync_write+0x12d/0x1d0 vfs_write+0xdb/0x1d0 ksys_pwrite64+0x65/0xa0 <-- tcmu-runner ZBC handler writes to the backend file in response to dm-zoned target write IO When XFS reclaim issues write IOs to the dm-zoned target device, dm-zoned issue write IOs to the tcmu-runner emulated device. However, the tcmu ZBC handler is singled threaded and blocked waiting for the completion of the started pwrite() call and so does not process the newly issued write IOs necessary to complete page reclaim. The system is in a deadlocked state. This problem is 100% reproducible, the deadlock happening after fio running for a few minutes. A similar deadlock was also observed for read operations into the ext4 backend file on page cache miss trigering a page allocation and page reclaim. Switching the tcmu emulated ZBC disk backend file from an ext4 file to an XFS file, none of these deadlocks are observed. Fix this problem by removing __GFP_FS from ext4 inode mapping gfp_mask. The code used for this fix is borrowed from XFS xfs_setup_inode(). The inode mapping gfp_mask initialization is added to ext4_set_aops(). Reported-by: Masato Suzuki <masato.suzuki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> --- fs/ext4/inode.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)