diff mbox

possible circular locking dependency detected

Message ID CA+55aFwntZpaAutp_C=sLBg8ravfYGRoF4TTG1cfE3SjaJEOtw@mail.gmail.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Linus Torvalds Sept. 1, 2016, 10:04 p.m. UTC
On Thu, Sep 1, 2016 at 2:43 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Sep 1, 2016 at 2:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> Outside as in "all fs activity in bind happens under it".  Along with
>> assignment to ->u.addr, etc.  IOW, make it the outermost lock there.
>
> Hah, yes. I misunderstood you.
>
> Yes. In fact that fixes the problem I mentioned, rather than introducing it.

So the easiest approach would seem to be to revert commit c845acb324aa
("af_unix: Fix splice-bind deadlock"), and then apply the lock split.

Like the attached two patches.

This is still *entirely* untested.

Rainer?

                 Linus

Comments

CAI Qian Sept. 2, 2016, 2:43 p.m. UTC | #1
----- Original Message -----
> From: "Linus Torvalds" <torvalds@linux-foundation.org>
> To: "Al Viro" <viro@zeniv.linux.org.uk>, "CAI Qian" <caiqian@redhat.com>
> Cc: "Miklos Szeredi" <miklos@szeredi.hu>, "Rainer Weikusat" <rweikusat@cyberadapt.com>, "Hannes Frederic Sowa"
> <hannes@stressinduktion.org>, "Rainer Weikusat" <rweikusat@mobileactivedefense.com>, "Eric Sandeen"
> <esandeen@redhat.com>, "Network Development" <netdev@vger.kernel.org>
> Sent: Thursday, September 1, 2016 6:04:38 PM
> Subject: Re: possible circular locking dependency detected
> 
> On Thu, Sep 1, 2016 at 2:43 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Thu, Sep 1, 2016 at 2:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >> Outside as in "all fs activity in bind happens under it".  Along with
> >> assignment to ->u.addr, etc.  IOW, make it the outermost lock there.
> >
> > Hah, yes. I misunderstood you.
> >
> > Yes. In fact that fixes the problem I mentioned, rather than introducing
> > it.
> 
> So the easiest approach would seem to be to revert commit c845acb324aa
> ("af_unix: Fix splice-bind deadlock"), and then apply the lock split.
> 
> Like the attached two patches.
> 
> This is still *entirely* untested.
Tested-by: CAI Qian <caiqian@redhat.com>
> 
> Rainer?
> 
>                  Linus
>
Rainer Weikusat Sept. 2, 2016, 3:18 p.m. UTC | #2
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Thu, Sep 1, 2016 at 2:43 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Sep 1, 2016 at 2:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>
>>> Outside as in "all fs activity in bind happens under it".  Along with
>>> assignment to ->u.addr, etc.  IOW, make it the outermost lock there.
>>
>> Hah, yes. I misunderstood you.
>>
>> Yes. In fact that fixes the problem I mentioned, rather than introducing it.
>
> So the easiest approach would seem to be to revert commit c845acb324aa
> ("af_unix: Fix splice-bind deadlock"), and then apply the lock split.
>
> Like the attached two patches.
>
> This is still *entirely* untested.

As far as I can tell, this should work as I can't currently imagine why
a fs operation might end up binding a unix socket despite the idea to
make af_unix.c yet more complicated in order to work around irregular
behaviour of (as far as I can tell) a single filesystem (for which
kern_path_create doesn't really mean kern_path_create and it has to work
around that once it gets control) goes against all instincts I have in
this area. If filesystems need to do arbitrary stuff when
__sb_start_write is called for 'their' superblock, they should be able
to do so directly.

At present, this is a theoretic concern as I can't (due to other work
committments) put any non-cursory work into this before Sunday. There
may also be other reasons why this idea is impractical or even
unworkable.
CAI Qian Sept. 2, 2016, 3:51 p.m. UTC | #3
----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Linus Torvalds" <torvalds@linux-foundation.org>
> Cc: "Al Viro" <viro@zeniv.linux.org.uk>, "Miklos Szeredi" <miklos@szeredi.hu>, "Rainer Weikusat"
> <rweikusat@cyberadapt.com>, "Hannes Frederic Sowa" <hannes@stressinduktion.org>, "Rainer Weikusat"
> <rweikusat@mobileactivedefense.com>, "Eric Sandeen" <esandeen@redhat.com>, "Network Development"
> <netdev@vger.kernel.org>
> Sent: Friday, September 2, 2016 10:43:20 AM
> Subject: Re: possible circular locking dependency detected
> 
> 
> 
> ----- Original Message -----
> > From: "Linus Torvalds" <torvalds@linux-foundation.org>
> > To: "Al Viro" <viro@zeniv.linux.org.uk>, "CAI Qian" <caiqian@redhat.com>
> > Cc: "Miklos Szeredi" <miklos@szeredi.hu>, "Rainer Weikusat"
> > <rweikusat@cyberadapt.com>, "Hannes Frederic Sowa"
> > <hannes@stressinduktion.org>, "Rainer Weikusat"
> > <rweikusat@mobileactivedefense.com>, "Eric Sandeen"
> > <esandeen@redhat.com>, "Network Development" <netdev@vger.kernel.org>
> > Sent: Thursday, September 1, 2016 6:04:38 PM
> > Subject: Re: possible circular locking dependency detected
> > 
> > On Thu, Sep 1, 2016 at 2:43 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Thu, Sep 1, 2016 at 2:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >>
> > >> Outside as in "all fs activity in bind happens under it".  Along with
> > >> assignment to ->u.addr, etc.  IOW, make it the outermost lock there.
> > >
> > > Hah, yes. I misunderstood you.
> > >
> > > Yes. In fact that fixes the problem I mentioned, rather than introducing
> > > it.
> > 
> > So the easiest approach would seem to be to revert commit c845acb324aa
> > ("af_unix: Fix splice-bind deadlock"), and then apply the lock split.
> > 
> > Like the attached two patches.
> > 
> > This is still *entirely* untested.
> Tested-by: CAI Qian <caiqian@redhat.com>
Actually, I took it back, and now spice seems start to deadlock using the reproducer,

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/splice/splice01.c

[ 1749.956818] 
[ 1749.958492] ======================================================
[ 1749.965386] [ INFO: possible circular locking dependency detected ]
[ 1749.972381] 4.8.0-rc4+ #34 Not tainted
[ 1749.976560] -------------------------------------------------------
[ 1749.983554] splice01/35921 is trying to acquire lock:
[ 1749.989188]  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
[ 1750.001644] 
[ 1750.001644] but task is already holding lock:
[ 1750.008151]  (&pipe->mutex/1){+.+.+.}, at: [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
[ 1750.016753] 
[ 1750.016753] which lock already depends on the new lock.
[ 1750.016753] 
[ 1750.025880] 
[ 1750.025880] the existing dependency chain (in reverse order) is:
[ 1750.034229] 
-> #2 (&pipe->mutex/1){+.+.+.}:
[ 1750.039139]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
[ 1750.045857]        [<ffffffff8266448d>] mutex_lock_nested+0xdd/0x850
[ 1750.052963]        [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
[ 1750.059190]        [<ffffffff8171ee25>] splice_to_pipe+0x75/0x9e0
[ 1750.066001]        [<ffffffff81723991>] __generic_file_splice_read+0xa71/0xe90
[ 1750.074071]        [<ffffffff81723e71>] generic_file_splice_read+0xc1/0x1f0
[ 1750.081849]        [<ffffffffa0838628>] xfs_file_splice_read+0x368/0x7b0 [xfs]
[ 1750.089940]        [<ffffffff8171fa7e>] do_splice_to+0xee/0x150
[ 1750.096555]        [<ffffffff817262f4>] SyS_splice+0x1144/0x1c10
[ 1750.103269]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
[ 1750.110084]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a
[ 1750.117479] 
-> #1 (&(&ip->i_iolock)->mr_lock#2){++++++}:
[ 1750.123649]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
[ 1750.130362]        [<ffffffff8129b93e>] down_write_nested+0x5e/0xe0
[ 1750.137371]        [<ffffffffa086772e>] xfs_ilock+0x2fe/0x550 [xfs]
[ 1750.144397]        [<ffffffffa083c204>] xfs_file_buffered_aio_write+0x134/0x840 [xfs]
[ 1750.153175]        [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
[ 1750.161177]        [<ffffffff8168374e>] __vfs_write+0x2be/0x640
[ 1750.167799]        [<ffffffff816876e2>] vfs_write+0x152/0x4b0
[ 1750.174220]        [<ffffffff8168b0df>] SyS_write+0xdf/0x1d0
[ 1750.180547]        [<ffffffff8266e9bc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 1750.188328] 
-> #0 (&sb->s_type->i_mutex_key#14){+.+.+.}:
[ 1750.194508]        [<ffffffff812adbc3>] __lock_acquire+0x3043/0x3dd0
[ 1750.201609]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
[ 1750.208321]        [<ffffffff82668cda>] down_write+0x5a/0xe0
[ 1750.214645]        [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
[ 1750.223421]        [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
[ 1750.231423]        [<ffffffff816859be>] vfs_iter_write+0x29e/0x550
[ 1750.238330]        [<ffffffff81722729>] iter_file_splice_write+0x529/0xb70
[ 1750.246012]        [<ffffffff817258d4>] SyS_splice+0x724/0x1c10
[ 1750.252627]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
[ 1750.259438]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a
[ 1750.266830] 
[ 1750.266830] other info that might help us debug this:
[ 1750.266830] 
[ 1750.275764] Chain exists of:
  &sb->s_type->i_mutex_key#14 --> &(&ip->i_iolock)->mr_lock#2 --> &pipe->mutex/1

[ 1750.287213]  Possible unsafe locking scenario:
[ 1750.287213] 
[ 1750.293817]        CPU0                    CPU1
[ 1750.298871]        ----                    ----
[ 1750.303924]   lock(&pipe->mutex/1);
[ 1750.307845]                                lock(&(&ip->i_iolock)->mr_lock#2);
[ 1750.315836]                                lock(&pipe->mutex/1);
[ 1750.322567]   lock(&sb->s_type->i_mutex_key#14);
[ 1750.327748] 
[ 1750.327748]  *** DEADLOCK ***
[ 1750.327748] 
[ 1750.334355] 2 locks held by splice01/35921:
[ 1750.339019]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff8168f444>] __sb_start_write+0xb4/0xf0
[ 1750.348595]  #1:  (&pipe->mutex/1){+.+.+.}, at: [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
[ 1750.357686] 
[ 1750.357686] stack backtrace:
[ 1750.362548] CPU: 50 PID: 35921 Comm: splice01 Not tainted 4.8.0-rc4+ #34
[ 1750.370026] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 1750.381382]  0000000000000000 000000003bca9477 ffff88044c4176e0 ffffffff81a3d191
[ 1750.389675]  ffffffff84292880 ffffffff842b9e30 ffff88044c417730 ffffffff812a6aa6
[ 1750.397968]  ffffffff84292880 ffff880414a28cd0 ffff88044c417850 ffff880414a28cd0
[ 1750.406261] Call Trace:
[ 1750.408992]  [<ffffffff81a3d191>] dump_stack+0x85/0xc4
[ 1750.414725]  [<ffffffff812a6aa6>] print_circular_bug+0x356/0x460
[ 1750.421428]  [<ffffffff812adbc3>] __lock_acquire+0x3043/0x3dd0
[ 1750.427942]  [<ffffffff81414fe9>] ? is_ftrace_trampoline+0x99/0xe0
[ 1750.434840]  [<ffffffff812aab80>] ? debug_check_no_locks_freed+0x2c0/0x2c0
[ 1750.442512]  [<ffffffff812a0272>] ? add_lock_to_list.isra.29.constprop.45+0x142/0x1d0
[ 1750.451249]  [<ffffffff812acd9e>] ? __lock_acquire+0x221e/0x3dd0
[ 1750.457952]  [<ffffffff812aa3ce>] ? trace_hardirqs_on_caller+0x3fe/0x580
[ 1750.465430]  [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
[ 1750.471578]  [<ffffffffa083c1f7>] ? xfs_file_buffered_aio_write+0x127/0x840 [xfs]
[ 1750.479929]  [<ffffffff82668cda>] down_write+0x5a/0xe0
[ 1750.485691]  [<ffffffffa083c1f7>] ? xfs_file_buffered_aio_write+0x127/0x840 [xfs]
[ 1750.494070]  [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
[ 1750.502226]  [<ffffffff81007b66>] ? do_syscall_64+0x1a6/0x500
[ 1750.508666]  [<ffffffffa083c0d0>] ? xfs_file_dio_aio_write+0xca0/0xca0 [xfs]
[ 1750.516532]  [<ffffffff812a9f72>] ? mark_held_locks+0xd2/0x130
[ 1750.523044]  [<ffffffff812f5887>] ? debug_lockdep_rcu_enabled+0x77/0x90
[ 1750.530417]  [<ffffffff82664873>] ? mutex_lock_nested+0x4c3/0x850
[ 1750.537243]  [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
[ 1750.544625]  [<ffffffff8169e7c1>] ? pipe_lock+0x51/0x60
[ 1750.550456]  [<ffffffff816859be>] vfs_iter_write+0x29e/0x550
[ 1750.556770]  [<ffffffff81685720>] ? vfs_iter_read+0x540/0x540
[ 1750.563181]  [<ffffffff81722729>] iter_file_splice_write+0x529/0xb70
[ 1750.570271]  [<ffffffff81722200>] ? page_cache_pipe_buf_confirm+0x1f0/0x1f0
[ 1750.578041]  [<ffffffff812f5a33>] ? rcu_read_lock_sched_held+0xa3/0x120
[ 1750.585423]  [<ffffffff812f6055>] ? rcu_sync_lockdep_assert+0x75/0xb0
[ 1750.592609]  [<ffffffff8129bd6c>] ? percpu_down_read+0x5c/0xa0
[ 1750.599118]  [<ffffffff8168f444>] ? __sb_start_write+0xb4/0xf0
[ 1750.605627]  [<ffffffff817258d4>] SyS_splice+0x724/0x1c10
[ 1750.611651]  [<ffffffff812f5a33>] ? rcu_read_lock_sched_held+0xa3/0x120
[ 1750.619033]  [<ffffffff817251b0>] ? compat_SyS_vmsplice+0x1f0/0x1f0
[ 1750.626025]  [<ffffffff81007a12>] ? do_syscall_64+0x52/0x500
[ 1750.632338]  [<ffffffff817251b0>] ? compat_SyS_vmsplice+0x1f0/0x1f0
[ 1750.639330]  [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
[ 1750.645549]  [<ffffffff8100401a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1750.652737]  [<ffffffff8266ea7f>] entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > Rainer?
> > 
> >                  Linus
> > 
>
Al Viro Sept. 2, 2016, 4 p.m. UTC | #4
On Fri, Sep 02, 2016 at 04:18:04PM +0100, Rainer Weikusat wrote:

> As far as I can tell, this should work as I can't currently imagine why
> a fs operation might end up binding a unix socket despite the idea to
> make af_unix.c yet more complicated in order to work around irregular
> behaviour of (as far as I can tell) a single filesystem (for which
> kern_path_create doesn't really mean kern_path_create

Bullshit.  kern_path_create() *does* mean the same thing in all cases.
Namely, find the parent, lock it and leave the final name component for
the create-type operation.  It sure as hell is not guaranteed to take
*all* locks that are going to be taken in process of mknod/mkdir/etc.
Never had been.

 and it has to work
> around that once it gets control) goes against all instincts I have in
> this area. If filesystems need to do arbitrary stuff when
> __sb_start_write is called for 'their' superblock, they should be able
> to do so directly.
> 
> At present, this is a theoretic concern as I can't (due to other work
> committments) put any non-cursory work into this before Sunday. There
> may also be other reasons why this idea is impractical or even
> unworkable.
Rainer Weikusat Sept. 2, 2016, 4:10 p.m. UTC | #5
Al Viro <viro@ZenIV.linux.org.uk> writes:
> On Fri, Sep 02, 2016 at 04:18:04PM +0100, Rainer Weikusat wrote:
>
>> As far as I can tell, this should work as I can't currently imagine
>> why a fs operation might end up binding a unix socket despite the
>> idea to make af_unix.c yet more complicated in order to work around
>> irregular behaviour of (as far as I can tell) a single filesystem
>> (for which kern_path_create doesn't really mean kern_path_create and
>> it has to work around that once it gets control) goes against all
>> instincts I have in this area. If filesystems need to do arbitrary
>> stuff when __sb_start_write is called for 'their' superblock, they
>> should be able to do so directly.
>
> Bullshit.  kern_path_create() *does* mean the same thing in all cases.
> Namely, find the parent, lock it and leave the final name component for
> the create-type operation.  It sure as hell is not guaranteed to take
> *all* locks that are going to be taken in process of mknod/mkdir/etc.
> Never had been.

This isn't about "all locks", it's about the lock in question. No other
mknod operation (I'm aware of) calls this with another superblock than
the one already acted upon by kern_path_create. This may be wrong (if
so, feel free to correct it) but it's not "bullshit" (intentional
deception in order to sell something to someone).
CAI Qian Sept. 2, 2016, 4:46 p.m. UTC | #6
----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Linus Torvalds" <torvalds@linux-foundation.org>
> Cc: "Al Viro" <viro@zeniv.linux.org.uk>, "Miklos Szeredi" <miklos@szeredi.hu>, "Rainer Weikusat"
> <rweikusat@cyberadapt.com>, "Hannes Frederic Sowa" <hannes@stressinduktion.org>, "Rainer Weikusat"
> <rweikusat@mobileactivedefense.com>, "Eric Sandeen" <esandeen@redhat.com>, "Network Development"
> <netdev@vger.kernel.org>
> Sent: Friday, September 2, 2016 11:51:58 AM
> Subject: Re: possible circular locking dependency detected
> 
> 
> 
> ----- Original Message -----
> > From: "CAI Qian" <caiqian@redhat.com>
> > To: "Linus Torvalds" <torvalds@linux-foundation.org>
> > Cc: "Al Viro" <viro@zeniv.linux.org.uk>, "Miklos Szeredi"
> > <miklos@szeredi.hu>, "Rainer Weikusat"
> > <rweikusat@cyberadapt.com>, "Hannes Frederic Sowa"
> > <hannes@stressinduktion.org>, "Rainer Weikusat"
> > <rweikusat@mobileactivedefense.com>, "Eric Sandeen" <esandeen@redhat.com>,
> > "Network Development"
> > <netdev@vger.kernel.org>
> > Sent: Friday, September 2, 2016 10:43:20 AM
> > Subject: Re: possible circular locking dependency detected
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Linus Torvalds" <torvalds@linux-foundation.org>
> > > To: "Al Viro" <viro@zeniv.linux.org.uk>, "CAI Qian" <caiqian@redhat.com>
> > > Cc: "Miklos Szeredi" <miklos@szeredi.hu>, "Rainer Weikusat"
> > > <rweikusat@cyberadapt.com>, "Hannes Frederic Sowa"
> > > <hannes@stressinduktion.org>, "Rainer Weikusat"
> > > <rweikusat@mobileactivedefense.com>, "Eric Sandeen"
> > > <esandeen@redhat.com>, "Network Development" <netdev@vger.kernel.org>
> > > Sent: Thursday, September 1, 2016 6:04:38 PM
> > > Subject: Re: possible circular locking dependency detected
> > > 
> > > On Thu, Sep 1, 2016 at 2:43 PM, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > > On Thu, Sep 1, 2016 at 2:01 PM, Al Viro <viro@zeniv.linux.org.uk>
> > > > wrote:
> > > >>
> > > >> Outside as in "all fs activity in bind happens under it".  Along with
> > > >> assignment to ->u.addr, etc.  IOW, make it the outermost lock there.
> > > >
> > > > Hah, yes. I misunderstood you.
> > > >
> > > > Yes. In fact that fixes the problem I mentioned, rather than
> > > > introducing
> > > > it.
> > > 
> > > So the easiest approach would seem to be to revert commit c845acb324aa
> > > ("af_unix: Fix splice-bind deadlock"), and then apply the lock split.
> > > 
> > > Like the attached two patches.
> > > 
> > > This is still *entirely* untested.
> > Tested-by: CAI Qian <caiqian@redhat.com>
OK, this tag still stand. The below issue is also reproduced without those
patches, so a separate problem most likely was introduced recently (after
rc3 or rc4) by probably some xfs update.
   CAI Qian 
> Actually, I took it back, and now spice seems start to deadlock using the
> reproducer,
> 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/splice/splice01.c
> 
> [ 1749.956818]
> [ 1749.958492] ======================================================
> [ 1749.965386] [ INFO: possible circular locking dependency detected ]
> [ 1749.972381] 4.8.0-rc4+ #34 Not tainted
> [ 1749.976560] -------------------------------------------------------
> [ 1749.983554] splice01/35921 is trying to acquire lock:
> [ 1749.989188]  (&sb->s_type->i_mutex_key#14){+.+.+.}, at:
> [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.001644]
> [ 1750.001644] but task is already holding lock:
> [ 1750.008151]  (&pipe->mutex/1){+.+.+.}, at: [<ffffffff8169e7c1>]
> pipe_lock+0x51/0x60
> [ 1750.016753]
> [ 1750.016753] which lock already depends on the new lock.
> [ 1750.016753]
> [ 1750.025880]
> [ 1750.025880] the existing dependency chain (in reverse order) is:
> [ 1750.034229]
> -> #2 (&pipe->mutex/1){+.+.+.}:
> [ 1750.039139]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.045857]        [<ffffffff8266448d>] mutex_lock_nested+0xdd/0x850
> [ 1750.052963]        [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
> [ 1750.059190]        [<ffffffff8171ee25>] splice_to_pipe+0x75/0x9e0
> [ 1750.066001]        [<ffffffff81723991>]
> __generic_file_splice_read+0xa71/0xe90
> [ 1750.074071]        [<ffffffff81723e71>]
> generic_file_splice_read+0xc1/0x1f0
> [ 1750.081849]        [<ffffffffa0838628>] xfs_file_splice_read+0x368/0x7b0
> [xfs]
> [ 1750.089940]        [<ffffffff8171fa7e>] do_splice_to+0xee/0x150
> [ 1750.096555]        [<ffffffff817262f4>] SyS_splice+0x1144/0x1c10
> [ 1750.103269]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.110084]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a
> [ 1750.117479]
> -> #1 (&(&ip->i_iolock)->mr_lock#2){++++++}:
> [ 1750.123649]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.130362]        [<ffffffff8129b93e>] down_write_nested+0x5e/0xe0
> [ 1750.137371]        [<ffffffffa086772e>] xfs_ilock+0x2fe/0x550 [xfs]
> [ 1750.144397]        [<ffffffffa083c204>]
> xfs_file_buffered_aio_write+0x134/0x840 [xfs]
> [ 1750.153175]        [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0
> [xfs]
> [ 1750.161177]        [<ffffffff8168374e>] __vfs_write+0x2be/0x640
> [ 1750.167799]        [<ffffffff816876e2>] vfs_write+0x152/0x4b0
> [ 1750.174220]        [<ffffffff8168b0df>] SyS_write+0xdf/0x1d0
> [ 1750.180547]        [<ffffffff8266e9bc>]
> entry_SYSCALL_64_fastpath+0x1f/0xbd
> [ 1750.188328]
> -> #0 (&sb->s_type->i_mutex_key#14){+.+.+.}:
> [ 1750.194508]        [<ffffffff812adbc3>] __lock_acquire+0x3043/0x3dd0
> [ 1750.201609]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.208321]        [<ffffffff82668cda>] down_write+0x5a/0xe0
> [ 1750.214645]        [<ffffffffa083c1f7>]
> xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.223421]        [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0
> [xfs]
> [ 1750.231423]        [<ffffffff816859be>] vfs_iter_write+0x29e/0x550
> [ 1750.238330]        [<ffffffff81722729>] iter_file_splice_write+0x529/0xb70
> [ 1750.246012]        [<ffffffff817258d4>] SyS_splice+0x724/0x1c10
> [ 1750.252627]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.259438]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a
> [ 1750.266830]
> [ 1750.266830] other info that might help us debug this:
> [ 1750.266830]
> [ 1750.275764] Chain exists of:
>   &sb->s_type->i_mutex_key#14 --> &(&ip->i_iolock)->mr_lock#2 -->
>   &pipe->mutex/1
> 
> [ 1750.287213]  Possible unsafe locking scenario:
> [ 1750.287213]
> [ 1750.293817]        CPU0                    CPU1
> [ 1750.298871]        ----                    ----
> [ 1750.303924]   lock(&pipe->mutex/1);
> [ 1750.307845]
> lock(&(&ip->i_iolock)->mr_lock#2);
> [ 1750.315836]                                lock(&pipe->mutex/1);
> [ 1750.322567]   lock(&sb->s_type->i_mutex_key#14);
> [ 1750.327748]
> [ 1750.327748]  *** DEADLOCK ***
> [ 1750.327748]
> [ 1750.334355] 2 locks held by splice01/35921:
> [ 1750.339019]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff8168f444>]
> __sb_start_write+0xb4/0xf0
> [ 1750.348595]  #1:  (&pipe->mutex/1){+.+.+.}, at: [<ffffffff8169e7c1>]
> pipe_lock+0x51/0x60
> [ 1750.357686]
> [ 1750.357686] stack backtrace:
> [ 1750.362548] CPU: 50 PID: 35921 Comm: splice01 Not tainted 4.8.0-rc4+ #34
> [ 1750.370026] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> [ 1750.381382]  0000000000000000 000000003bca9477 ffff88044c4176e0
> ffffffff81a3d191
> [ 1750.389675]  ffffffff84292880 ffffffff842b9e30 ffff88044c417730
> ffffffff812a6aa6
> [ 1750.397968]  ffffffff84292880 ffff880414a28cd0 ffff88044c417850
> ffff880414a28cd0
> [ 1750.406261] Call Trace:
> [ 1750.408992]  [<ffffffff81a3d191>] dump_stack+0x85/0xc4
> [ 1750.414725]  [<ffffffff812a6aa6>] print_circular_bug+0x356/0x460
> [ 1750.421428]  [<ffffffff812adbc3>] __lock_acquire+0x3043/0x3dd0
> [ 1750.427942]  [<ffffffff81414fe9>] ? is_ftrace_trampoline+0x99/0xe0
> [ 1750.434840]  [<ffffffff812aab80>] ? debug_check_no_locks_freed+0x2c0/0x2c0
> [ 1750.442512]  [<ffffffff812a0272>] ?
> add_lock_to_list.isra.29.constprop.45+0x142/0x1d0
> [ 1750.451249]  [<ffffffff812acd9e>] ? __lock_acquire+0x221e/0x3dd0
> [ 1750.457952]  [<ffffffff812aa3ce>] ? trace_hardirqs_on_caller+0x3fe/0x580
> [ 1750.465430]  [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.471578]  [<ffffffffa083c1f7>] ?
> xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.479929]  [<ffffffff82668cda>] down_write+0x5a/0xe0
> [ 1750.485691]  [<ffffffffa083c1f7>] ?
> xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.494070]  [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840
> [xfs]
> [ 1750.502226]  [<ffffffff81007b66>] ? do_syscall_64+0x1a6/0x500
> [ 1750.508666]  [<ffffffffa083c0d0>] ? xfs_file_dio_aio_write+0xca0/0xca0
> [xfs]
> [ 1750.516532]  [<ffffffff812a9f72>] ? mark_held_locks+0xd2/0x130
> [ 1750.523044]  [<ffffffff812f5887>] ? debug_lockdep_rcu_enabled+0x77/0x90
> [ 1750.530417]  [<ffffffff82664873>] ? mutex_lock_nested+0x4c3/0x850
> [ 1750.537243]  [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
> [ 1750.544625]  [<ffffffff8169e7c1>] ? pipe_lock+0x51/0x60
> [ 1750.550456]  [<ffffffff816859be>] vfs_iter_write+0x29e/0x550
> [ 1750.556770]  [<ffffffff81685720>] ? vfs_iter_read+0x540/0x540
> [ 1750.563181]  [<ffffffff81722729>] iter_file_splice_write+0x529/0xb70
> [ 1750.570271]  [<ffffffff81722200>] ?
> page_cache_pipe_buf_confirm+0x1f0/0x1f0
> [ 1750.578041]  [<ffffffff812f5a33>] ? rcu_read_lock_sched_held+0xa3/0x120
> [ 1750.585423]  [<ffffffff812f6055>] ? rcu_sync_lockdep_assert+0x75/0xb0
> [ 1750.592609]  [<ffffffff8129bd6c>] ? percpu_down_read+0x5c/0xa0
> [ 1750.599118]  [<ffffffff8168f444>] ? __sb_start_write+0xb4/0xf0
> [ 1750.605627]  [<ffffffff817258d4>] SyS_splice+0x724/0x1c10
> [ 1750.611651]  [<ffffffff812f5a33>] ? rcu_read_lock_sched_held+0xa3/0x120
> [ 1750.619033]  [<ffffffff817251b0>] ? compat_SyS_vmsplice+0x1f0/0x1f0
> [ 1750.626025]  [<ffffffff81007a12>] ? do_syscall_64+0x52/0x500
> [ 1750.632338]  [<ffffffff817251b0>] ? compat_SyS_vmsplice+0x1f0/0x1f0
> [ 1750.639330]  [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.645549]  [<ffffffff8100401a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
> [ 1750.652737]  [<ffffffff8266ea7f>] entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > Rainer?
> > > 
> > >                  Linus
> > > 
> >
Al Viro Sept. 2, 2016, 5:02 p.m. UTC | #7
On Fri, Sep 02, 2016 at 05:10:13PM +0100, Rainer Weikusat wrote:

> > Bullshit.  kern_path_create() *does* mean the same thing in all cases.
> > Namely, find the parent, lock it and leave the final name component for
> > the create-type operation.  It sure as hell is not guaranteed to take
> > *all* locks that are going to be taken in process of mknod/mkdir/etc.
> > Never had been.
> 
> This isn't about "all locks", it's about the lock in question. No other
> mknod operation (I'm aware of) calls this with another superblock than
> the one already acted upon by kern_path_create. This may be wrong (if
> so, feel free to correct it) but it's not "bullshit" (intentional
> deception in order to sell something to someone).
> 

Never had been promised.  And it's not just this lock - e.g. ->i_rwsem is
taken on the parent by kern_path_create() and on parent in underlying
filesystem by ecryptfs ->mknod() (as well as overlayfs one).  bind/bind
deadlock - one for a path to ecryptfs, another for that on the raw
filesystem behind it (which can be mounted elsewhere/in another namespace/etc.)
with those paths ending in the matching directories (the last components may
be same or different - doesn't matter)

A: lock parent in ecryptfs (via kern_path_create())
B: lock the directory behind it in underlying fs (ditto)
A: grab ->readlock
B: block on ->readlock
A: call ecryptfs_mknod() and block trying to lock the directory held by B

Deadlock.  And while we are at it, ecryptfs probably ought to claim transient
write access for the duration of modifications of the underlying one similar
to overlayfs.  The point is, it had never been promised that you can stick
random locks just outside of ->mknod()/->mkdir()/etc.  The same goes for e.g.
NFS mount of something exported by localhost; knfsd must lock the parent
directory on server before creating anything in it.  Suppose you have
/srv/nfs/foo exported and mounted on the same host at /mnt/bar.  bind to
/mnt/bar/barf/a vs. bind to /srv/nfs/foo/barf/b:
A: lock /mnt/bar/barf
B: lock /srv/nfs/foo/barf
A: grab ->readlock
B: block on ->readlock
A: call nfs_mknod(), wait for reply
knfsd: block trying to lock /srv/nfs/foo/barf

It's very much _not_ just overlayfs being pathological - that it certainly is,
but the problem is much wider.  You could try to argue that kern_path_create()
should've known to lock all relevant directories in case of overlayfs and
ecryptfs, but it has absolutely no chance to do that in case of NFS - the
protocol doesn't allow "lock this directory, one of my next requests will
be to create something in it".  Even leases are only for regular files...
Linus Torvalds Sept. 2, 2016, 5:10 p.m. UTC | #8
On Fri, Sep 2, 2016 at 8:51 AM, CAI Qian <caiqian@redhat.com> wrote:
>
> Actually, I took it back, and now spice seems start to deadlock using the reproducer,
>
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/splice/splice01.c

This is a different deadlock, though. This is a deadlock due to mixed
lock ordering between the pipe mutex, the XFS "mr_lock", and the inode
mutex.

If I read the lockdep trace correctly, we have:

Normally we have write doing inode->i_mutex -> i_iolock.mr_lock fro
the regular write path.

But the normal splice "write()" case has

  pipe->mutex -> filesystem write lock (normally i_mutex)

(in iter_file_splice_write() that calls vfs_iter_write() that calls
->write_iter())

and then the XFS splice case as

    mr_lock -> pipe->mutex

in xfs_file_splice_read() calling splice_to_pipe().

So you end up with a A->B->C->A chain.

I think it's new to the new iomap based buffered write path in 4.8.

Dave, Christoph?

                 Linus

> [ 1749.956818]
> [ 1749.958492] ======================================================
> [ 1749.965386] [ INFO: possible circular locking dependency detected ]
> [ 1749.972381] 4.8.0-rc4+ #34 Not tainted
> [ 1749.976560] -------------------------------------------------------
> [ 1749.983554] splice01/35921 is trying to acquire lock:
> [ 1749.989188]  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.001644]
> [ 1750.001644] but task is already holding lock:
> [ 1750.008151]  (&pipe->mutex/1){+.+.+.}, at: [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
> [ 1750.016753]
> [ 1750.016753] which lock already depends on the new lock.
> [ 1750.016753]
> [ 1750.025880]
> [ 1750.025880] the existing dependency chain (in reverse order) is:
> [ 1750.034229]
> -> #2 (&pipe->mutex/1){+.+.+.}:
> [ 1750.039139]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.045857]        [<ffffffff8266448d>] mutex_lock_nested+0xdd/0x850
> [ 1750.052963]        [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
> [ 1750.059190]        [<ffffffff8171ee25>] splice_to_pipe+0x75/0x9e0
> [ 1750.066001]        [<ffffffff81723991>] __generic_file_splice_read+0xa71/0xe90
> [ 1750.074071]        [<ffffffff81723e71>] generic_file_splice_read+0xc1/0x1f0
> [ 1750.081849]        [<ffffffffa0838628>] xfs_file_splice_read+0x368/0x7b0 [xfs]
> [ 1750.089940]        [<ffffffff8171fa7e>] do_splice_to+0xee/0x150
> [ 1750.096555]        [<ffffffff817262f4>] SyS_splice+0x1144/0x1c10
> [ 1750.103269]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.110084]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a
> [ 1750.117479]
> -> #1 (&(&ip->i_iolock)->mr_lock#2){++++++}:
> [ 1750.123649]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.130362]        [<ffffffff8129b93e>] down_write_nested+0x5e/0xe0
> [ 1750.137371]        [<ffffffffa086772e>] xfs_ilock+0x2fe/0x550 [xfs]
> [ 1750.144397]        [<ffffffffa083c204>] xfs_file_buffered_aio_write+0x134/0x840 [xfs]
> [ 1750.153175]        [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
> [ 1750.161177]        [<ffffffff8168374e>] __vfs_write+0x2be/0x640
> [ 1750.167799]        [<ffffffff816876e2>] vfs_write+0x152/0x4b0
> [ 1750.174220]        [<ffffffff8168b0df>] SyS_write+0xdf/0x1d0
> [ 1750.180547]        [<ffffffff8266e9bc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
> [ 1750.188328]
> -> #0 (&sb->s_type->i_mutex_key#14){+.+.+.}:
> [ 1750.194508]        [<ffffffff812adbc3>] __lock_acquire+0x3043/0x3dd0
> [ 1750.201609]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.208321]        [<ffffffff82668cda>] down_write+0x5a/0xe0
> [ 1750.214645]        [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.223421]        [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
> [ 1750.231423]        [<ffffffff816859be>] vfs_iter_write+0x29e/0x550
> [ 1750.238330]        [<ffffffff81722729>] iter_file_splice_write+0x529/0xb70
> [ 1750.246012]        [<ffffffff817258d4>] SyS_splice+0x724/0x1c10
> [ 1750.252627]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.259438]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a
> [ 1750.266830]
> [ 1750.266830] other info that might help us debug this:
> [ 1750.266830]
> [ 1750.275764] Chain exists of:
>   &sb->s_type->i_mutex_key#14 --> &(&ip->i_iolock)->mr_lock#2 --> &pipe->mutex/1
>
> [ 1750.287213]  Possible unsafe locking scenario:
> [ 1750.287213]
> [ 1750.293817]        CPU0                    CPU1
> [ 1750.298871]        ----                    ----
> [ 1750.303924]   lock(&pipe->mutex/1);
> [ 1750.307845]                                lock(&(&ip->i_iolock)->mr_lock#2);
> [ 1750.315836]                                lock(&pipe->mutex/1);
> [ 1750.322567]   lock(&sb->s_type->i_mutex_key#14);
> [ 1750.327748]
> [ 1750.327748]  *** DEADLOCK ***
> [ 1750.327748]
> [ 1750.334355] 2 locks held by splice01/35921:
> [ 1750.339019]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff8168f444>] __sb_start_write+0xb4/0xf0
> [ 1750.348595]  #1:  (&pipe->mutex/1){+.+.+.}, at: [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
> [ 1750.357686]
> [ 1750.357686] stack backtrace:
> [ 1750.362548] CPU: 50 PID: 35921 Comm: splice01 Not tainted 4.8.0-rc4+ #34
> [ 1750.370026] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> [ 1750.381382]  0000000000000000 000000003bca9477 ffff88044c4176e0 ffffffff81a3d191
> [ 1750.389675]  ffffffff84292880 ffffffff842b9e30 ffff88044c417730 ffffffff812a6aa6
> [ 1750.397968]  ffffffff84292880 ffff880414a28cd0 ffff88044c417850 ffff880414a28cd0
> [ 1750.406261] Call Trace:
> [ 1750.408992]  [<ffffffff81a3d191>] dump_stack+0x85/0xc4
> [ 1750.414725]  [<ffffffff812a6aa6>] print_circular_bug+0x356/0x460
> [ 1750.421428]  [<ffffffff812adbc3>] __lock_acquire+0x3043/0x3dd0
> [ 1750.427942]  [<ffffffff81414fe9>] ? is_ftrace_trampoline+0x99/0xe0
> [ 1750.434840]  [<ffffffff812aab80>] ? debug_check_no_locks_freed+0x2c0/0x2c0
> [ 1750.442512]  [<ffffffff812a0272>] ? add_lock_to_list.isra.29.constprop.45+0x142/0x1d0
> [ 1750.451249]  [<ffffffff812acd9e>] ? __lock_acquire+0x221e/0x3dd0
> [ 1750.457952]  [<ffffffff812aa3ce>] ? trace_hardirqs_on_caller+0x3fe/0x580
> [ 1750.465430]  [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.471578]  [<ffffffffa083c1f7>] ? xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.479929]  [<ffffffff82668cda>] down_write+0x5a/0xe0
> [ 1750.485691]  [<ffffffffa083c1f7>] ? xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.494070]  [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.502226]  [<ffffffff81007b66>] ? do_syscall_64+0x1a6/0x500
> [ 1750.508666]  [<ffffffffa083c0d0>] ? xfs_file_dio_aio_write+0xca0/0xca0 [xfs]
> [ 1750.516532]  [<ffffffff812a9f72>] ? mark_held_locks+0xd2/0x130
> [ 1750.523044]  [<ffffffff812f5887>] ? debug_lockdep_rcu_enabled+0x77/0x90
> [ 1750.530417]  [<ffffffff82664873>] ? mutex_lock_nested+0x4c3/0x850
> [ 1750.537243]  [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
> [ 1750.544625]  [<ffffffff8169e7c1>] ? pipe_lock+0x51/0x60
> [ 1750.550456]  [<ffffffff816859be>] vfs_iter_write+0x29e/0x550
> [ 1750.556770]  [<ffffffff81685720>] ? vfs_iter_read+0x540/0x540
> [ 1750.563181]  [<ffffffff81722729>] iter_file_splice_write+0x529/0xb70
> [ 1750.570271]  [<ffffffff81722200>] ? page_cache_pipe_buf_confirm+0x1f0/0x1f0
> [ 1750.578041]  [<ffffffff812f5a33>] ? rcu_read_lock_sched_held+0xa3/0x120
> [ 1750.585423]  [<ffffffff812f6055>] ? rcu_sync_lockdep_assert+0x75/0xb0
> [ 1750.592609]  [<ffffffff8129bd6c>] ? percpu_down_read+0x5c/0xa0
> [ 1750.599118]  [<ffffffff8168f444>] ? __sb_start_write+0xb4/0xf0
> [ 1750.605627]  [<ffffffff817258d4>] SyS_splice+0x724/0x1c10
> [ 1750.611651]  [<ffffffff812f5a33>] ? rcu_read_lock_sched_held+0xa3/0x120
> [ 1750.619033]  [<ffffffff817251b0>] ? compat_SyS_vmsplice+0x1f0/0x1f0
> [ 1750.626025]  [<ffffffff81007a12>] ? do_syscall_64+0x52/0x500
> [ 1750.632338]  [<ffffffff817251b0>] ? compat_SyS_vmsplice+0x1f0/0x1f0
> [ 1750.639330]  [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.645549]  [<ffffffff8100401a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
> [ 1750.652737]  [<ffffffff8266ea7f>] entry_SYSCALL64_slow_path+0x25/0x25
Linus Torvalds Sept. 2, 2016, 5:12 p.m. UTC | #9
On Fri, Sep 2, 2016 at 10:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> It's very much _not_ just overlayfs being pathological - that it certainly is,
> but the problem is much wider.

Al, can you take a look at my two patches, and see if you agree that
they fix it, though?

Of course, we now have *another* splice deadlock. That pipe inode is
nasty, it's very easy to deadlock on it in subtle ways.

             Linus
Rainer Weikusat Sept. 2, 2016, 5:40 p.m. UTC | #10
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Fri, Sep 2, 2016 at 10:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> It's very much _not_ just overlayfs being pathological - that it certainly is,
>> but the problem is much wider.
>
> Al, can you take a look at my two patches, and see if you agree that
> they fix it, though?

The original deadlock occurred because of some code path locking the
superblock followed by trying to acquire the af_unix readlock while
unix_bind did the same in the opposite order (by doing kern_path_create
with the readlock held). If unix_bind doesn't share a lock with the
receive routines anymore, this obviously can't happen anymore.

The other problem situation is one where a lock on someting can be
acquired both by kern_path_create and a mknod operation and the readlock
is taken in between. Because that sits in between the kern_path_create
and the mknod, it can block the thread which got a certain lock via
kern_path_create because the one which is about to try to acquire it via
mknod got the readlock first. This obviously can't happen anymore the when the
original 'acquire readlock (now bindlock) first' is restored.
Al Viro Sept. 2, 2016, 5:52 p.m. UTC | #11
On Fri, Sep 02, 2016 at 10:12:08AM -0700, Linus Torvalds wrote:
> On Fri, Sep 2, 2016 at 10:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > It's very much _not_ just overlayfs being pathological - that it certainly is,
> > but the problem is much wider.
> 
> Al, can you take a look at my two patches, and see if you agree that
> they fix it, though?

AFAICS, they should.  Locking is obviously saner that way and AFAICS the
rest is absolutely straightforward.

Acked-by: Al Viro <viro@zeniv.linux.org.uk>

> Of course, we now have *another* splice deadlock. That pipe inode is
> nasty, it's very easy to deadlock on it in subtle ways.

I'm still digging through iomap.c, but that's better taken to another branch
of this thread...
Al Viro Sept. 2, 2016, 5:53 p.m. UTC | #12
On Fri, Sep 02, 2016 at 06:40:59PM +0100, Rainer Weikusat wrote:

> The original deadlock occurred because of some code path locking the
> superblock followed by trying to acquire the af_unix readlock while

Not even that - one code path takes ->readlock under pipe lock, while
another takes pipe lock under sb_start_write...
diff mbox

Patch

From 9a76489d81f6d2b1da22906363d28c398d4f7c5c Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 1 Sep 2016 14:43:53 -0700
Subject: [PATCH 2/2] af_unix: split 'u->readlock' into two: 'iolock' and
 'bindlock'

Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.

The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking.  The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.

We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.

Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/net/af_unix.h |  2 +-
 net/unix/af_unix.c    | 45 +++++++++++++++++++++++----------------------
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 9b4c418bebd8..fd60eccb59a6 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -52,7 +52,7 @@  struct unix_sock {
 	struct sock		sk;
 	struct unix_address     *addr;
 	struct path		path;
-	struct mutex		readlock;
+	struct mutex		iolock, bindlock;
 	struct sock		*peer;
 	struct list_head	link;
 	atomic_long_t		inflight;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 433ae1bbef97..8309687a56b0 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -661,11 +661,11 @@  static int unix_set_peek_off(struct sock *sk, int val)
 {
 	struct unix_sock *u = unix_sk(sk);
 
-	if (mutex_lock_interruptible(&u->readlock))
+	if (mutex_lock_interruptible(&u->iolock))
 		return -EINTR;
 
 	sk->sk_peek_off = val;
-	mutex_unlock(&u->readlock);
+	mutex_unlock(&u->iolock);
 
 	return 0;
 }
@@ -779,7 +779,8 @@  static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 	spin_lock_init(&u->lock);
 	atomic_long_set(&u->inflight, 0);
 	INIT_LIST_HEAD(&u->link);
-	mutex_init(&u->readlock); /* single task reading lock */
+	mutex_init(&u->iolock); /* single task reading lock */
+	mutex_init(&u->bindlock); /* single task binding lock */
 	init_waitqueue_head(&u->peer_wait);
 	init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
 	unix_insert_socket(unix_sockets_unbound(sk), sk);
@@ -848,7 +849,7 @@  static int unix_autobind(struct socket *sock)
 	int err;
 	unsigned int retries = 0;
 
-	err = mutex_lock_interruptible(&u->readlock);
+	err = mutex_lock_interruptible(&u->bindlock);
 	if (err)
 		return err;
 
@@ -895,7 +896,7 @@  retry:
 	spin_unlock(&unix_table_lock);
 	err = 0;
 
-out:	mutex_unlock(&u->readlock);
+out:	mutex_unlock(&u->bindlock);
 	return err;
 }
 
@@ -1009,7 +1010,7 @@  static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		goto out;
 	addr_len = err;
 
-	err = mutex_lock_interruptible(&u->readlock);
+	err = mutex_lock_interruptible(&u->bindlock);
 	if (err)
 		goto out;
 
@@ -1063,7 +1064,7 @@  static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 out_unlock:
 	spin_unlock(&unix_table_lock);
 out_up:
-	mutex_unlock(&u->readlock);
+	mutex_unlock(&u->bindlock);
 out:
 	return err;
 }
@@ -1955,17 +1956,17 @@  static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
 	if (false) {
 alloc_skb:
 		unix_state_unlock(other);
-		mutex_unlock(&unix_sk(other)->readlock);
+		mutex_unlock(&unix_sk(other)->iolock);
 		newskb = sock_alloc_send_pskb(sk, 0, 0, flags & MSG_DONTWAIT,
 					      &err, 0);
 		if (!newskb)
 			goto err;
 	}
 
-	/* we must acquire readlock as we modify already present
+	/* we must acquire iolock as we modify already present
 	 * skbs in the sk_receive_queue and mess with skb->len
 	 */
-	err = mutex_lock_interruptible(&unix_sk(other)->readlock);
+	err = mutex_lock_interruptible(&unix_sk(other)->iolock);
 	if (err) {
 		err = flags & MSG_DONTWAIT ? -EAGAIN : -ERESTARTSYS;
 		goto err;
@@ -2032,7 +2033,7 @@  alloc_skb:
 	}
 
 	unix_state_unlock(other);
-	mutex_unlock(&unix_sk(other)->readlock);
+	mutex_unlock(&unix_sk(other)->iolock);
 
 	other->sk_data_ready(other);
 	scm_destroy(&scm);
@@ -2041,7 +2042,7 @@  alloc_skb:
 err_state_unlock:
 	unix_state_unlock(other);
 err_unlock:
-	mutex_unlock(&unix_sk(other)->readlock);
+	mutex_unlock(&unix_sk(other)->iolock);
 err:
 	kfree_skb(newskb);
 	if (send_sigpipe && !(flags & MSG_NOSIGNAL))
@@ -2109,7 +2110,7 @@  static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
 	do {
-		mutex_lock(&u->readlock);
+		mutex_lock(&u->iolock);
 
 		skip = sk_peek_offset(sk, flags);
 		skb = __skb_try_recv_datagram(sk, flags, &peeked, &skip, &err,
@@ -2117,14 +2118,14 @@  static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 		if (skb)
 			break;
 
-		mutex_unlock(&u->readlock);
+		mutex_unlock(&u->iolock);
 
 		if (err != -EAGAIN)
 			break;
 	} while (timeo &&
 		 !__skb_wait_for_more_packets(sk, &err, &timeo, last));
 
-	if (!skb) { /* implies readlock unlocked */
+	if (!skb) { /* implies iolock unlocked */
 		unix_state_lock(sk);
 		/* Signal EOF on disconnected non-blocking SEQPACKET socket. */
 		if (sk->sk_type == SOCK_SEQPACKET && err == -EAGAIN &&
@@ -2189,7 +2190,7 @@  static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 
 out_free:
 	skb_free_datagram(sk, skb);
-	mutex_unlock(&u->readlock);
+	mutex_unlock(&u->iolock);
 out:
 	return err;
 }
@@ -2284,7 +2285,7 @@  static int unix_stream_read_generic(struct unix_stream_read_state *state)
 	/* Lock the socket to prevent queue disordering
 	 * while sleeps in memcpy_tomsg
 	 */
-	mutex_lock(&u->readlock);
+	mutex_lock(&u->iolock);
 
 	if (flags & MSG_PEEK)
 		skip = sk_peek_offset(sk, flags);
@@ -2326,7 +2327,7 @@  again:
 				break;
 			}
 
-			mutex_unlock(&u->readlock);
+			mutex_unlock(&u->iolock);
 
 			timeo = unix_stream_data_wait(sk, timeo, last,
 						      last_len);
@@ -2337,7 +2338,7 @@  again:
 				goto out;
 			}
 
-			mutex_lock(&u->readlock);
+			mutex_lock(&u->iolock);
 			goto redo;
 unlock:
 			unix_state_unlock(sk);
@@ -2440,7 +2441,7 @@  unlock:
 		}
 	} while (size);
 
-	mutex_unlock(&u->readlock);
+	mutex_unlock(&u->iolock);
 	if (state->msg)
 		scm_recv(sock, state->msg, &scm, flags);
 	else
@@ -2481,9 +2482,9 @@  static ssize_t skb_unix_socket_splice(struct sock *sk,
 	int ret;
 	struct unix_sock *u = unix_sk(sk);
 
-	mutex_unlock(&u->readlock);
+	mutex_unlock(&u->iolock);
 	ret = splice_to_pipe(pipe, spd);
-	mutex_lock(&u->readlock);
+	mutex_lock(&u->iolock);
 
 	return ret;
 }
-- 
2.10.0.rc0.2.g0a9fa47