diff mbox series

powerpc/qspinlock: Fix deadlock in MCS queue

Message ID 20240826081251.744325-1-nysal@linux.ibm.com (mailing list archive)
State Superseded, archived
Headers show
Series powerpc/qspinlock: Fix deadlock in MCS queue | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 5 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 21 jobs.

Commit Message

Nysal Jan K.A. Aug. 26, 2024, 8:12 a.m. UTC
If an interrupt occurs in queued_spin_lock_slowpath() after we increment
qnodesp->count and before node->lock is initialized, another CPU might
see stale lock values in get_tail_qnode(). If the stale lock value happens
to match the lock on that CPU, then we write to the "next" pointer of
the wrong qnode. This causes a deadlock as the former CPU, once it becomes
the head of the MCS queue, will spin indefinitely until it's "next" pointer
is set by its successor in the queue. This results in lockups similar to
the following.

   watchdog: CPU 15 Hard LOCKUP
   ......
   NIP [c0000000000b78f4] queued_spin_lock_slowpath+0x1184/0x1490
   LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
   Call Trace:
    0xc000002cfffa3bf0 (unreliable)
    _raw_spin_lock+0x6c/0x90
    raw_spin_rq_lock_nested.part.135+0x4c/0xd0
    sched_ttwu_pending+0x60/0x1f0
    __flush_smp_call_function_queue+0x1dc/0x670
    smp_ipi_demux_relaxed+0xa4/0x100
    xive_muxed_ipi_action+0x20/0x40
    __handle_irq_event_percpu+0x80/0x240
    handle_irq_event_percpu+0x2c/0x80
    handle_percpu_irq+0x84/0xd0
    generic_handle_irq+0x54/0x80
    __do_irq+0xac/0x210
    __do_IRQ+0x74/0xd0
    0x0
    do_IRQ+0x8c/0x170
    hardware_interrupt_common_virt+0x29c/0x2a0
   --- interrupt: 500 at queued_spin_lock_slowpath+0x4b8/0x1490
   ......
   NIP [c0000000000b6c28] queued_spin_lock_slowpath+0x4b8/0x1490
   LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
   --- interrupt: 500
    0xc0000029c1a41d00 (unreliable)
    _raw_spin_lock+0x6c/0x90
    futex_wake+0x100/0x260
    do_futex+0x21c/0x2a0
    sys_futex+0x98/0x270
    system_call_exception+0x14c/0x2f0
    system_call_vectored_common+0x15c/0x2ec

The following code flow illustrates how the deadlock occurs:

        CPU0                                   CPU1
        ----                                   ----
  spin_lock_irqsave(A)                          |
  spin_unlock_irqrestore(A)                     |
    spin_lock(B)                                |
         |                                      |
         ▼                                      |
   id = qnodesp->count++;                       |
  (Note that nodes[0].lock == A)                |
         |                                      |
         ▼                                      |
      Interrupt                                 |
  (happens before "nodes[0].lock = B")          |
         |                                      |
         ▼                                      |
  spin_lock_irqsave(A)                          |
         |                                      |
         ▼                                      |
   id = qnodesp->count++                        |
   nodes[1].lock = A                            |
         |                                      |
         ▼                                      |
  Tail of MCS queue                             |
         |                             spin_lock_irqsave(A)
         ▼                                      |
  Head of MCS queue                             ▼
         |                             CPU0 is previous tail
         ▼                                      |
   Spin indefinitely                            ▼
  (until "nodes[1].next != NULL")      prev = get_tail_qnode(A, CPU0)
                                                |
                                                ▼
                                       prev == &qnodes[CPU0].nodes[0]
                                     (as qnodes[CPU0].nodes[0].lock == A)
                                                |
                                                ▼
                                       WRITE_ONCE(prev->next, node)
                                                |
                                                ▼
                                        Spin indefinitely
                                     (until nodes[0].locked == 1)

Thanks to Saket Kumar Bhaskar for help with recreating the issue

Fixes: 84990b169557 ("powerpc/qspinlock: add mcs queueing for contended waiters")
Cc: stable@vger.kernel.org # v6.2+
Reported-by: Geetika Moolchandani <geetika@linux.ibm.com>
Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com>
Reported-by: Jijo Varghese <vargjijo@in.ibm.com>
Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
---
 arch/powerpc/lib/qspinlock.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Nicholas Piggin Aug. 28, 2024, 3:19 a.m. UTC | #1
Hey Nysal,

This is really good debugging, and a nice write up.

On Mon Aug 26, 2024 at 6:12 PM AEST, Nysal Jan K.A. wrote:
> If an interrupt occurs in queued_spin_lock_slowpath() after we increment
> qnodesp->count and before node->lock is initialized, another CPU might
> see stale lock values in get_tail_qnode(). If the stale lock value happens
> to match the lock on that CPU, then we write to the "next" pointer of
> the wrong qnode. This causes a deadlock as the former CPU, once it becomes
> the head of the MCS queue, will spin indefinitely until it's "next" pointer
> is set by its successor in the queue. This results in lockups similar to
> the following.
>
>    watchdog: CPU 15 Hard LOCKUP
>    ......
>    NIP [c0000000000b78f4] queued_spin_lock_slowpath+0x1184/0x1490
>    LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
>    Call Trace:
>     0xc000002cfffa3bf0 (unreliable)
>     _raw_spin_lock+0x6c/0x90
>     raw_spin_rq_lock_nested.part.135+0x4c/0xd0
>     sched_ttwu_pending+0x60/0x1f0
>     __flush_smp_call_function_queue+0x1dc/0x670
>     smp_ipi_demux_relaxed+0xa4/0x100
>     xive_muxed_ipi_action+0x20/0x40
>     __handle_irq_event_percpu+0x80/0x240
>     handle_irq_event_percpu+0x2c/0x80
>     handle_percpu_irq+0x84/0xd0
>     generic_handle_irq+0x54/0x80
>     __do_irq+0xac/0x210
>     __do_IRQ+0x74/0xd0
>     0x0
>     do_IRQ+0x8c/0x170
>     hardware_interrupt_common_virt+0x29c/0x2a0
>    --- interrupt: 500 at queued_spin_lock_slowpath+0x4b8/0x1490
>    ......
>    NIP [c0000000000b6c28] queued_spin_lock_slowpath+0x4b8/0x1490
>    LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
>    --- interrupt: 500
>     0xc0000029c1a41d00 (unreliable)
>     _raw_spin_lock+0x6c/0x90
>     futex_wake+0x100/0x260
>     do_futex+0x21c/0x2a0
>     sys_futex+0x98/0x270
>     system_call_exception+0x14c/0x2f0
>     system_call_vectored_common+0x15c/0x2ec
>
> The following code flow illustrates how the deadlock occurs:
>
>         CPU0                                   CPU1
>         ----                                   ----
>   spin_lock_irqsave(A)                          |
>   spin_unlock_irqrestore(A)                     |
>     spin_lock(B)                                |
>          |                                      |
>          ▼                                      |
>    id = qnodesp->count++;                       |
>   (Note that nodes[0].lock == A)                |
>          |                                      |
>          ▼                                      |
>       Interrupt                                 |
>   (happens before "nodes[0].lock = B")          |
>          |                                      |
>          ▼                                      |
>   spin_lock_irqsave(A)                          |
>          |                                      |
>          ▼                                      |
>    id = qnodesp->count++                        |
>    nodes[1].lock = A                            |
>          |                                      |
>          ▼                                      |
>   Tail of MCS queue                             |
>          |                             spin_lock_irqsave(A)
>          ▼                                      |
>   Head of MCS queue                             ▼
>          |                             CPU0 is previous tail
>          ▼                                      |
>    Spin indefinitely                            ▼
>   (until "nodes[1].next != NULL")      prev = get_tail_qnode(A, CPU0)
>                                                 |
>                                                 ▼
>                                        prev == &qnodes[CPU0].nodes[0]
>                                      (as qnodes[CPU0].nodes[0].lock == A)
>                                                 |
>                                                 ▼
>                                        WRITE_ONCE(prev->next, node)
>                                                 |
>                                                 ▼
>                                         Spin indefinitely
>                                      (until nodes[0].locked == 1)

I can follow your scenario, and agree it is a bug.

Generic qspinlock code does not have a similar path because it encodes
idx with the CPU in the spinlock word. The powerpc qspinlocks removed
that to save some bits in the word (to support more CPUs).

What probably makes it really difficult to hit is that I think both
locks A and B need contention from other sources to push them into
queueing slow path. I guess that's omitted for brevity in the flow
above, which is fine.

> Thanks to Saket Kumar Bhaskar for help with recreating the issue
>
> Fixes: 84990b169557 ("powerpc/qspinlock: add mcs queueing for contended waiters")
> Cc: stable@vger.kernel.org # v6.2+
> Reported-by: Geetika Moolchandani <geetika@linux.ibm.com>
> Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com>
> Reported-by: Jijo Varghese <vargjijo@in.ibm.com>
> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
> ---
>  arch/powerpc/lib/qspinlock.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5de4dd549f6e..59861c665cef 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -697,6 +697,12 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	}
>  
>  release:
> +	/*
> +	 * Clear the lock, as another CPU might see stale values if an
> +	 * interrupt occurs after we increment qnodesp->count but before
> +	 * node->lock is initialized
> +	 */
> +	node->lock = NULL;
>  	qnodesp->count--; /* release the node */

AFAIKS this fix works.

There is one complication which is those two stores could be swapped by
the compiler. So we could take an IRQ here that sees the node has been
freed, but node->lock has not yet been cleared. Basically equivalent to
the problem solved by the barrier() on the count++ side.

This reordering would not cause a problem in your scenario AFAIKS
because when the lock call returns, node->lock *will* be cleared so it
can not cause a problem later.

Still, should we put a barrier() between these just to make things a
bit cleaner? I.e., when count is decremented, we definitely won't do
any other stores to node. Otherwise,

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick
Michael Ellerman Aug. 28, 2024, 3:52 a.m. UTC | #2
"Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> If an interrupt occurs in queued_spin_lock_slowpath() after we increment
> qnodesp->count and before node->lock is initialized, another CPU might
> see stale lock values in get_tail_qnode(). If the stale lock value happens
> to match the lock on that CPU, then we write to the "next" pointer of
> the wrong qnode. This causes a deadlock as the former CPU, once it becomes
> the head of the MCS queue, will spin indefinitely until it's "next" pointer
> is set by its successor in the queue. This results in lockups similar to
> the following.
...
>
> Thanks to Saket Kumar Bhaskar for help with recreating the issue
>
> Fixes: 84990b169557 ("powerpc/qspinlock: add mcs queueing for contended waiters")
> Cc: stable@vger.kernel.org # v6.2+
> Reported-by: Geetika Moolchandani <geetika@linux.ibm.com>
> Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com>
> Reported-by: Jijo Varghese <vargjijo@in.ibm.com>
 
Do we have links for any of these reports?

cheers
Nysal Jan K.A. Aug. 28, 2024, 4:27 a.m. UTC | #3
On Wed, Aug 28, 2024 at 01:19:46PM GMT, Nicholas Piggin wrote:
<snip>
> What probably makes it really difficult to hit is that I think both
> locks A and B need contention from other sources to push them into
> queueing slow path. I guess that's omitted for brevity in the flow
> above, which is fine.
> 

I'll mention that in the commit message, just so that it is clear.

> 
> AFAIKS this fix works.
> 
> There is one complication which is those two stores could be swapped by
> the compiler. So we could take an IRQ here that sees the node has been
> freed, but node->lock has not yet been cleared. Basically equivalent to
> the problem solved by the barrier() on the count++ side.
> 
> This reordering would not cause a problem in your scenario AFAIKS
> because when the lock call returns, node->lock *will* be cleared so it
> can not cause a problem later.
> 
> Still, should we put a barrier() between these just to make things a
> bit cleaner? I.e., when count is decremented, we definitely won't do
> any other stores to node. Otherwise,
> 

Agree, that will make it cleaner.

> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> 
> Thanks,
> Nick

Thanks for the review Nick, I'll send a v2 with these changes.

Regards
--Nysal
Nysal Jan K.A. Aug. 28, 2024, 4:33 a.m. UTC | #4
On Wed, Aug 28, 2024 at 01:52:33PM GMT, Michael Ellerman wrote:
> "Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> > If an interrupt occurs in queued_spin_lock_slowpath() after we increment
> > qnodesp->count and before node->lock is initialized, another CPU might
> > see stale lock values in get_tail_qnode(). If the stale lock value happens
> > to match the lock on that CPU, then we write to the "next" pointer of
> > the wrong qnode. This causes a deadlock as the former CPU, once it becomes
> > the head of the MCS queue, will spin indefinitely until it's "next" pointer
> > is set by its successor in the queue. This results in lockups similar to
> > the following.
> ...
> >
> > Thanks to Saket Kumar Bhaskar for help with recreating the issue
> >
> > Fixes: 84990b169557 ("powerpc/qspinlock: add mcs queueing for contended waiters")
> > Cc: stable@vger.kernel.org # v6.2+
> > Reported-by: Geetika Moolchandani <geetika@linux.ibm.com>
> > Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com>
> > Reported-by: Jijo Varghese <vargjijo@in.ibm.com>
>  
> Do we have links for any of these reports?

They are all internal reports on LTC bugzilla. I don't see one that is public.

> 
> cheers

Regards
--Nysal
diff mbox series

Patch

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 5de4dd549f6e..59861c665cef 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -697,6 +697,12 @@  static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	}
 
 release:
+	/*
+	 * Clear the lock, as another CPU might see stale values if an
+	 * interrupt occurs after we increment qnodesp->count but before
+	 * node->lock is initialized
+	 */
+	node->lock = NULL;
 	qnodesp->count--; /* release the node */
 }