diff mbox series

[BUG,REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs

Message ID 37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com
State Superseded
Headers show
Series [BUG,REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs | expand

Commit Message

Zhihao Cheng July 12, 2024, 6:27 a.m. UTC
Hi. Recently, we found a deadlock in inode recliaiming process caused by 
circular dependence between file inode and file's xattr inode.

Problem description
===================

The inode reclaiming process(See function prune_icache_sb) collects all 
reclaimable inodes and mark them with I_FREEING flag at first, at that 
time, other processes will be stuck if they try getting these inodes(See 
function find_inode_fast), then the reclaiming process destroy the 
inodes by function dispose_list().
Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may 
do inode lookup in the inode evicting callback function, if the inode 
lookup is operated under the inode lru traversing context, deadlock 
problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen 
if ea_inode feature is enabled, the lookup process will be stuck under 
the evicting context like this:

  1. File A has inode i_reg and an ea inode i_ea
  2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
  3. Then, following three processes running like this:

     PA                              PB
  echo 2 > /proc/sys/vm/drop_caches
   shrink_slab
    prune_dcache_sb
    // i_reg is added into lru, lru->i_ea->i_reg
    prune_icache_sb
     list_lru_walk_one
      inode_lru_isolate
       i_ea->i_state |= I_FREEING // set inode state
      inode_lru_isolate
       __iget(i_reg)
       spin_unlock(&i_reg->i_lock)
       spin_unlock(lru_lock)
                                      rm file A
                                       i_reg->nlink = 0
       iput(i_reg) // i_reg->nlink is 0, do evict
        ext4_evict_inode
         ext4_xattr_delete_inode
          ext4_xattr_inode_dec_ref_all
       ext4_xattr_inode_iget
            ext4_iget(i_ea->i_ino)
             iget_locked
              find_inode_fast
               __wait_on_freeing_inode(i_ea) ----→ AA deadlock
     dispose_list // cannot be executed by prune_icache_sb
      wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file 
deleting process holds BASEHD's wbuf->io_mutex while getting the xattr 
inode, which could race with inode reclaiming process(The reclaiming 
process could try locking BASEHD's wbuf->io_mutex in inode evicting 
function), then an ABBA deadlock problem would happen as following:

  1. File A has inode ia and a xattr(with inode ixa), regular file B has
     inode ib and a xattr.
  2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
  3. Then, following three processes running like this:

         PA                PB                        PC
                 echo 2 > /proc/sys/vm/drop_caches
                  shrink_slab
           prune_dcache_sb
           // ib and ia are added into lru, lru->ixa->ib->ia
                   prune_icache_sb
                    list_lru_walk_one
                     inode_lru_isolate
                      ixa->i_state |= I_FREEING // set inode state
                     inode_lru_isolate
                      __iget(ib)
                      spin_unlock(&ib->i_lock)
                      spin_unlock(lru_lock)
                                                    rm file B
                                                     ib->nlink = 0
  rm file A
   iput(ia)
    ubifs_evict_inode(ia)
     ubifs_jnl_delete_inode(ia)
      ubifs_jnl_write_inode(ia)
       make_reservation(BASEHD) // Lock wbuf->io_mutex
       ubifs_iget(ixa->i_ino)
        iget_locked
         find_inode_fast
          __wait_on_freeing_inode(ixa)
           |          iput(ib) // ib->nlink is 0, do evict
           |           ubifs_evict_inode
           |            ubifs_jnl_delete_inode(ib)
          ↓             ubifs_jnl_write_inode
      ABBA deadlock ←-----make_reservation(BASEHD)
                    dispose_list // cannot be executed by prune_icache_sb
                     wake_up_bit(&ixa->i_state)

Reproducer:
===========

https://bugzilla.kernel.org/show_bug.cgi?id=219022

About solutions
===============

We have thought some solutions, but all of them will import new influence.

Solution 1: Don't cache xattr inode, make drop_inode callback return 
true for xattr inode. It will make getting xattr slower.
Test code:
     gettimeofday(&s, NULL);
     for (i = 0; i < 10000; ++i)
         if (getxattr("/root/temp/file_a", "user.a", buf, 4098) < 0)
             perror("getxattr");
     gettimeofday(&e, NULL);
     printf("cost %ld us\n", 1000000 * (e.tv_sec - s.tv_sec) + e.tv_usec 
- s.tv_usec);
Result:
ext4:
cost 151068 us // without fix
cost 321594 us // with fix
ubifs:
134125 us // without fix
371023 us // with fix

Solution 2: Don't put xattr inode into lru, which is implemented by 
holding xattr inode's refcnt until the file inode is evicted, besides 
that, make drop_inode callback return true for xattr inode. The solution 
pins xattr inode in memory until the file inode is evicted, file inode 
won't be evicted if it has pagecahes, specifically:
inode_lru_isolate
  if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { // 
file inode won't be evicted, so its' xattr inode won't be reclaimed too, 
which will increase the memory noise.
   __iget(inode);
   if (remove_inode_buffers(inode))
   ...
   iput(inode);
  }

Solution 3: Forbid inode evicting under the inode lru traversing 
context. Specifically, mark inode with 'I_FREEING' instead of getting 
its' refcnt to eliminate the inode getting chance in 
inode_lru_isolate(). The solution is wrong, because the pagecahes may 
still alive after invalidate_mapping_pages(), so we cannot destroy the 
file inode after clearing I_WILL_FREE.
          if (remove_inode_buffers(inode)) {
@@ -855,9 +856,9 @@ static enum lru_status inode_lru_isolate(struct 
list_head *item,
                  __count_vm_events(PGINODESTEAL, reap);
              mm_account_reclaimed_pages(reap);
          }
-        iput(inode);
+        spin_lock(&inode->i_lock);
+        inode->i_state &= ~I_WILL_FREE;
          spin_lock(lru_lock);
-        return LRU_RETRY;
      }

Solution 4: Break the circular dependence between file inode and file's 
xattr inode, for example, after comparing 
inode_lru_isolate(prune_icache_sb) with invalidate_inodes, we found that 
invalidate_mapping_pages is not invoked by invalidate_inodes, can we 
directly remove the branch 'if (inode_has_buffers(inode) || 
!mapping_empty(&inode->i_data))' from inode_lru_isolate?

Any better solutions?

Comments

Theodore Ts'o July 12, 2024, 2:37 p.m. UTC | #1
On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
> Problem description
> ===================
> 
> The inode reclaiming process(See function prune_icache_sb) collects all
> reclaimable inodes and mark them with I_FREEING flag at first, at that
> time, other processes will be stuck if they try getting these inodes(See
> function find_inode_fast), then the reclaiming process destroy the
> inodes by function dispose_list().
> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
> do inode lookup in the inode evicting callback function, if the inode
> lookup is operated under the inode lru traversing context, deadlock
> problems may happen.
> 
> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
> if ea_inode feature is enabled, the lookup process will be stuck under
> the evicting context like this:
> 
>  1. File A has inode i_reg and an ea inode i_ea
>  2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>  3. Then, following three processes running like this:
> 
>     PA                              PB
>  echo 2 > /proc/sys/vm/drop_caches
>   shrink_slab
>    prune_dcache_sb
>    // i_reg is added into lru, lru->i_ea->i_reg
>    prune_icache_sb
>     list_lru_walk_one
>      inode_lru_isolate
>       i_ea->i_state |= I_FREEING // set inode state
>       i_ea->i_state |= I_FREEING // set inode state

Um, I don't see how this can happen.  If the ea_inode is in use,
i_count will be greater than zero, and hence the inode will never be
go down the rest of the path in inode_lru_inode():

	if (atomic_read(&inode->i_count) ||
	    ...) {
		list_lru_isolate(lru, &inode->i_lru);
		spin_unlock(&inode->i_lock);
		this_cpu_dec(nr_unused);
		return LRU_REMOVED;
	}

Do you have an actual reproduer which triggers this?  Or would this
happen be any chance something that was dreamed up with DEPT?

       	      	     	       	    		- Ted
Zhihao Cheng July 13, 2024, 2:29 a.m. UTC | #2
在 2024/7/12 22:37, Theodore Ts'o 写道:
>> Problem description
>> ===================
>>
>> The inode reclaiming process(See function prune_icache_sb) collects all
>> reclaimable inodes and mark them with I_FREEING flag at first, at that
>> time, other processes will be stuck if they try getting these inodes(See
>> function find_inode_fast), then the reclaiming process destroy the
>> inodes by function dispose_list().
>> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
>> do inode lookup in the inode evicting callback function, if the inode
>> lookup is operated under the inode lru traversing context, deadlock
>> problems may happen.
>>
>> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>> if ea_inode feature is enabled, the lookup process will be stuck under
>> the evicting context like this:
>>
>>   1. File A has inode i_reg and an ea inode i_ea
>>   2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>>   3. Then, following three processes running like this:
>>
>>      PA                              PB
>>   echo 2 > /proc/sys/vm/drop_caches
>>    shrink_slab
>>     prune_dcache_sb
>>     // i_reg is added into lru, lru->i_ea->i_reg
>>     prune_icache_sb
>>      list_lru_walk_one
>>       inode_lru_isolate
>>        i_ea->i_state |= I_FREEING // set inode state
>>        i_ea->i_state |= I_FREEING // set inode state
> Um, I don't see how this can happen.  If the ea_inode is in use,
> i_count will be greater than zero, and hence the inode will never be
> go down the rest of the path in inode_lru_inode():

The counter of ea_inode could become zero before the file inode, 
according to the following process:
path_getxattr
  user_path_at(&path) // get file dentry and file inode
  getxattr
   ext4_xattr_get
    ext4_xattr_ibody_get
     ext4_xattr_inode_get
      ext4_xattr_inode_iget(&ea_inode)  // ea_inode->i_count = 1
      iput(ea_inode) // ea_inode->i_count = 0, put it into lru
  path_put(&path); // put file dentry and file inode

> 
> 	if (atomic_read(&inode->i_count) ||
> 	    ...) {
> 		list_lru_isolate(lru, &inode->i_lru);
> 		spin_unlock(&inode->i_lock);
> 		this_cpu_dec(nr_unused);
> 		return LRU_REMOVED;
> 	}
> 
> Do you have an actual reproduer which triggers this?  Or would this
> happen be any chance something that was dreamed up with DEPT?

The reproducer is in the second half of the ariticle, along with some of 
the solutions we tried.

Reproducer:
===========

https://bugzilla.kernel.org/show_bug.cgi?id=219022

About solutions
===============

[...]
Ryder Wang July 18, 2024, 3:04 a.m. UTC | #3
> Um, I don't see how this can happen.  If the ea_inode is in use,
> i_count will be greater than zero, and hence the inode will never be
> go down the rest of the path in inode_lru_inode():
> 
>         if (atomic_read(&inode->i_count) ||
>             ...) {
>                 list_lru_isolate(lru, &inode->i_lru);
>                 spin_unlock(&inode->i_lock);
>                 this_cpu_dec(nr_unused);
>                 return LRU_REMOVED;
>         }

Yes, in the function inode_lru_inode (in case of clearing cache), there has been such inode->i_state check mechanism to avoid double-removing the inode which is being removed by another process. Unluckily, no such similar inode->i_state check mechanism in the function iput_final (in case of removing file), so double-removing inode can still appear.

It looks we need to add some inode->i_state check in iput_final() , if we want to fix this race condition bug.
Zhihao Cheng July 18, 2024, 7:30 a.m. UTC | #4
在 2024/7/18 11:04, Ryder Wang 写道:
Hi, Ryder
>> Um, I don't see how this can happen.  If the ea_inode is in use,
>> i_count will be greater than zero, and hence the inode will never be
>> go down the rest of the path in inode_lru_inode():
>>
>>          if (atomic_read(&inode->i_count) ||
>>              ...) {
>>                  list_lru_isolate(lru, &inode->i_lru);
>>                  spin_unlock(&inode->i_lock);
>>                  this_cpu_dec(nr_unused);
>>                  return LRU_REMOVED;
>>          }
> 
> Yes, in the function inode_lru_inode (in case of clearing cache), there has been such inode->i_state check mechanism to avoid double-removing the inode which is being removed by another process. Unluckily, no such similar inode->i_state check mechanism in the function iput_final (in case of removing file), so double-removing inode can still appear.

I'm a little confused about the process of inode double-removing, can 
you provide a detailed case about how double-revemoving happens? I can't 
find the relationship between inode double-removing and the problem i 
described.
> 
> It looks we need to add some inode->i_state check in iput_final() , if we want to fix this race condition bug.
Jan Kara July 18, 2024, 1:40 p.m. UTC | #5
On Fri 12-07-24 10:37:08, Theodore Ts'o wrote:
> On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
> > Problem description
> > ===================
> > 
> > The inode reclaiming process(See function prune_icache_sb) collects all
> > reclaimable inodes and mark them with I_FREEING flag at first, at that
> > time, other processes will be stuck if they try getting these inodes(See
> > function find_inode_fast), then the reclaiming process destroy the
> > inodes by function dispose_list().
> > Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
> > do inode lookup in the inode evicting callback function, if the inode
> > lookup is operated under the inode lru traversing context, deadlock
> > problems may happen.
> > 
> > Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
> > if ea_inode feature is enabled, the lookup process will be stuck under
> > the evicting context like this:
> > 
> >  1. File A has inode i_reg and an ea inode i_ea
> >  2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
> >  3. Then, following three processes running like this:
> > 
> >     PA                              PB
> >  echo 2 > /proc/sys/vm/drop_caches
> >   shrink_slab
> >    prune_dcache_sb
> >    // i_reg is added into lru, lru->i_ea->i_reg
> >    prune_icache_sb
> >     list_lru_walk_one
> >      inode_lru_isolate
> >       i_ea->i_state |= I_FREEING // set inode state
> >       i_ea->i_state |= I_FREEING // set inode state
> 
> Um, I don't see how this can happen.  If the ea_inode is in use,
> i_count will be greater than zero, and hence the inode will never be
> go down the rest of the path in inode_lru_inode():
> 
> 	if (atomic_read(&inode->i_count) ||
> 	    ...) {
> 		list_lru_isolate(lru, &inode->i_lru);
> 		spin_unlock(&inode->i_lock);
> 		this_cpu_dec(nr_unused);
> 		return LRU_REMOVED;
> 	}
> 
> Do you have an actual reproduer which triggers this?  Or would this
> happen be any chance something that was dreamed up with DEPT?

No, it looks like a real problem and I agree with the analysis. We don't
hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The
normal inode just owns that that special on-disk xattr reference. Standard
inode references are acquired and dropped as needed.

And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from
evict() needs to lookup the ea_inode and iget() it. So if we are processing
a list of inodes to dispose, all inodes have I_FREEING bit already set and
if ea_inode and its parent normal inode are both in the list, then the
evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock.

Normally we don't hit this path because LRU list walk is not handling
inodes with 0 link count. But a race with unlink can make that happen with
iput() from inode_lru_isolate().

I'm pondering about the best way to fix this. Maybe we could handle the
need for inode pinning in inode_lru_isolate() in a similar way as in
writeback code so that last iput() cannot happen from inode_lru_isolate().
In writeback we use I_SYNC flag to pin the inode and evict() waits for this
flag to clear. I'll probably sleep to it and if I won't find it too
disgusting to live tomorrow, I can code it.

								Honza
Zhihao Cheng July 19, 2024, 3:21 a.m. UTC | #6
Hi, Jan

在 2024/7/18 21:40, Jan Kara 写道:
> On Fri 12-07-24 10:37:08, Theodore Ts'o wrote:
>> On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
>>> Problem description
>>> ===================
>>>
>>> The inode reclaiming process(See function prune_icache_sb) collects all
>>> reclaimable inodes and mark them with I_FREEING flag at first, at that
>>> time, other processes will be stuck if they try getting these inodes(See
>>> function find_inode_fast), then the reclaiming process destroy the
>>> inodes by function dispose_list().
>>> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
>>> do inode lookup in the inode evicting callback function, if the inode
>>> lookup is operated under the inode lru traversing context, deadlock
>>> problems may happen.
>>>
>>> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>>> if ea_inode feature is enabled, the lookup process will be stuck under
>>> the evicting context like this:
>>>
>>>   1. File A has inode i_reg and an ea inode i_ea
>>>   2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>>>   3. Then, following three processes running like this:
>>>
>>>      PA                              PB
>>>   echo 2 > /proc/sys/vm/drop_caches
>>>    shrink_slab
>>>     prune_dcache_sb
>>>     // i_reg is added into lru, lru->i_ea->i_reg
>>>     prune_icache_sb
>>>      list_lru_walk_one
>>>       inode_lru_isolate
>>>        i_ea->i_state |= I_FREEING // set inode state
>>>        i_ea->i_state |= I_FREEING // set inode state
>>
>> Um, I don't see how this can happen.  If the ea_inode is in use,
>> i_count will be greater than zero, and hence the inode will never be
>> go down the rest of the path in inode_lru_inode():
>>
>> 	if (atomic_read(&inode->i_count) ||
>> 	    ...) {
>> 		list_lru_isolate(lru, &inode->i_lru);
>> 		spin_unlock(&inode->i_lock);
>> 		this_cpu_dec(nr_unused);
>> 		return LRU_REMOVED;
>> 	}
>>
>> Do you have an actual reproduer which triggers this?  Or would this
>> happen be any chance something that was dreamed up with DEPT?
> 
> No, it looks like a real problem and I agree with the analysis. We don't
> hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The
> normal inode just owns that that special on-disk xattr reference. Standard
> inode references are acquired and dropped as needed.
> 
> And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from
> evict() needs to lookup the ea_inode and iget() it. So if we are processing
> a list of inodes to dispose, all inodes have I_FREEING bit already set and
> if ea_inode and its parent normal inode are both in the list, then the
> evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock.

Yes, absolutely right.
> 
> Normally we don't hit this path because LRU list walk is not handling
> inodes with 0 link count. But a race with unlink can make that happen with
> iput() from inode_lru_isolate().

Another reason is that mapping_empty(&inode->i_data) is consistent with 
mapping_shrinkable(&inode->i_data) in most cases(CONFIG_HIGHMEM is 
disabled in default on 64bit platforms, so mapping_shrinkable() hardly 
returns true if file inode's mapping has pagecahes), the problem path 
expects that mapping_shrinkable() returns true and mapping_empty() 
returns false.

Do we have any other methods to replace following if-branch without 
invoking __iget()?

         /* 

          * On highmem systems, mapping_shrinkable() permits dropping 

          * page cache in order to free up struct inodes: lowmem might 

          * be under pressure before the cache inside the highmem zone. 

          */ 

         if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) 
{
                 __iget(inode);
                 ...
                 iput(inode); 

                 spin_lock(lru_lock); 

                 return LRU_RETRY; 

         }
> 
> I'm pondering about the best way to fix this. Maybe we could handle the
> need for inode pinning in inode_lru_isolate() in a similar way as in
> writeback code so that last iput() cannot happen from inode_lru_isolate().
> In writeback we use I_SYNC flag to pin the inode and evict() waits for this
> flag to clear. I'll probably sleep to it and if I won't find it too
> disgusting to live tomorrow, I can code it.
> 

I guess that you may modify like this:
diff --git a/fs/inode.c b/fs/inode.c
index f356fe2ec2b6..5b1a9b23f53f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold);

  static void __inode_add_lru(struct inode *inode, bool rotate)
  {
-       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | 
I_WILL_FREE))
+       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | 
I_WILL_FREE | I_PINING))
                 return;
         if (atomic_read(&inode->i_count))
                 return;
@@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct 
list_head *item,
          * be under pressure before the cache inside the highmem zone.
          */
         if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
-               __iget(inode);
+               inode->i_state |= I_PINING;
                 spin_unlock(&inode->i_lock);
                 spin_unlock(lru_lock);
                 if (remove_inode_buffers(inode)) {
@@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct 
list_head *item,
                                 __count_vm_events(PGINODESTEAL, reap);
                         mm_account_reclaimed_pages(reap);
                 }
-               iput(inode);
+               spin_lock(&inode->i_lock);
+               inode->i_state &= ~I_PINING;
+               wake_up_bit(&inode->i_state, __I_PINING);
+               spin_unlock(&inode->i_lock);
                 spin_lock(lru_lock);
                 return LRU_RETRY;
         }
@@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode)
                 return;
         }

+       inode_wait_for_pining(inode);
         state = inode->i_state;
         if (!drop) {
                 WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..daf094fff5fe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb 
*kiocb, struct kiocb *kiocb_src,
  #define I_DONTCACHE            (1 << 16)
  #define I_SYNC_QUEUED          (1 << 17)
  #define I_PINNING_NETFS_WB     (1 << 18)
+#define __I_PINING             19
+#define I_PINING               (1 << __I_PINING)

  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)

, which means that we will import a new inode state to solve the problem.
Mateusz Guzik July 20, 2024, 10:42 a.m. UTC | #7
On Fri, Jul 19, 2024 at 11:21:51AM +0800, Zhihao Cheng wrote:
> 在 2024/7/18 21:40, Jan Kara 写道:
> > I'm pondering about the best way to fix this. Maybe we could handle the
> > need for inode pinning in inode_lru_isolate() in a similar way as in
> > writeback code so that last iput() cannot happen from inode_lru_isolate().
> > In writeback we use I_SYNC flag to pin the inode and evict() waits for this
> > flag to clear. I'll probably sleep to it and if I won't find it too
> > disgusting to live tomorrow, I can code it.
> > 
> 
> I guess that you may modify like this:
> diff --git a/fs/inode.c b/fs/inode.c
> index f356fe2ec2b6..5b1a9b23f53f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold);
> 
>  static void __inode_add_lru(struct inode *inode, bool rotate)
>  {
> -       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING |
> I_WILL_FREE))
> +       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE
> | I_PINING))
>                 return;
>         if (atomic_read(&inode->i_count))
>                 return;
> @@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
>          * be under pressure before the cache inside the highmem zone.
>          */
>         if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
> -               __iget(inode);
> +               inode->i_state |= I_PINING;
>                 spin_unlock(&inode->i_lock);
>                 spin_unlock(lru_lock);
>                 if (remove_inode_buffers(inode)) {
> @@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
>                                 __count_vm_events(PGINODESTEAL, reap);
>                         mm_account_reclaimed_pages(reap);
>                 }
> -               iput(inode);
> +               spin_lock(&inode->i_lock);
> +               inode->i_state &= ~I_PINING;
> +               wake_up_bit(&inode->i_state, __I_PINING);
> +               spin_unlock(&inode->i_lock);
>                 spin_lock(lru_lock);
>                 return LRU_RETRY;
>         }
> @@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode)
>                 return;
>         }
> 
> +       inode_wait_for_pining(inode);
>         state = inode->i_state;
>         if (!drop) {
>                 WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..daf094fff5fe 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb *kiocb,
> struct kiocb *kiocb_src,
>  #define I_DONTCACHE            (1 << 16)
>  #define I_SYNC_QUEUED          (1 << 17)
>  #define I_PINNING_NETFS_WB     (1 << 18)
> +#define __I_PINING             19
> +#define I_PINING               (1 << __I_PINING)
> 
>  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> 
> , which means that we will import a new inode state to solve the problem.
> 

My non-maintainer $0,03 is as follows:

1. I_PINING is too generic of a name. I_LRU_PINNED or something else
indicating what this is for would be prudent
2. while not specific to this patch, the handling of i_state is too
accidental-breakage friendly. a full blown solution is way out of the
scope here, but something can be done to future-proof this work anyway.

To that end I would suggest:
1. inode_lru_pin() which appart from setting the flag includes:
	BUG_ON(inode->i_state & (I_LRU_PINNED | I_FREEING | I_WILL_FREE)
2. inode_lru_unpin() which apart from unsetting the flag + wakeup includes:
	BUG_ON(!(inode->i_state & I_LRU_PINNED))
3. inode_lru_wait_for_pinned() 

However, a non-cosmetic remark is that at the spot inode_wait_for_pining
gets invoked none of the of the pinning-blocking flags may be set (to my
reading anyway). This is not the end of the world, but it does mean the
waiting routine will have to check stuff in a loop.

Names are not that important, the key is to keep the logic and
dependencies close by code-wise.
Zhihao Cheng Aug. 5, 2024, 1:29 a.m. UTC | #8
Hi, based on the ideas from Jan and Mateusz, I sent a fix patch, see 
https://lore.kernel.org/linux-fsdevel/20240805013446.814357-1-chengzhihao@huaweicloud.com/T/#u

在 2024/7/12 14:27, Zhihao Cheng 写道:
> Hi. Recently, we found a deadlock in inode recliaiming process caused by 
> circular dependence between file inode and file's xattr inode.
> 
> Problem description
> ===================
> 
> The inode reclaiming process(See function prune_icache_sb) collects all 
> reclaimable inodes and mark them with I_FREEING flag at first, at that 
> time, other processes will be stuck if they try getting these inodes(See 
> function find_inode_fast), then the reclaiming process destroy the 
> inodes by function dispose_list().
> Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may 
> do inode lookup in the inode evicting callback function, if the inode 
> lookup is operated under the inode lru traversing context, deadlock 
> problems may happen.
>
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..c649be22f841 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -843,7 +844,7 @@  static enum lru_status inode_lru_isolate(struct 
list_head *item,
       * be under pressure before the cache inside the highmem zone.
       */
      if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
-        __iget(inode);
+        inode->i_state |= I_WILL_FREE;
          spin_unlock(&inode->i_lock);
          spin_unlock(lru_lock);