diff mbox series

powerpc: Fix data corruption on IPI

Message ID 1654757454.47202735.1699948827325.JavaMail.zimbra@raptorengineeringinc.com (mailing list archive)
State Superseded, archived
Headers show
Series powerpc: Fix data corruption on IPI | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 23 jobs.

Commit Message

Timothy Pearson Nov. 14, 2023, 8 a.m. UTC
From 0b2678b7cdada1a3d9aec8626f31a988d81373fa Mon Sep 17 00:00:00 2001
From: Timothy Pearson <tpearson@raptorengineering.com>
Date: Mon, 13 Nov 2023 22:42:58 -0600
Subject: [PATCH] powerpc: Fix data corruption on IPI

On multithreaded SMP workloads such as those using io_uring, it is possible for
multiple threads to hold an inconsistent view of system memory when an IPI is
issued.  This in turn leads to userspace memory corruption with varying degrees
of probability based on workload and inter-thread timing.

io_uring provokes this bug by its use of TWA_SIGNAL during thread creation,
which is especially noticeable as significant userspace data corruption with
certain workloads such as MariaDB (bug MDEV-30728).  While using
TWA_SIGNAL_NO_IPI works around the corruption, no other architecture requires
this workaround.

Issue an lwsync barrier instruction prior to sending the IPI.  This ensures
the receiving CPU has a consistent view of system memory, in line with other
architectures.

Tested under QEMU in kvm mode, running on a Talos II workstation with dual
POWER9 DD2.2 CPUs.

Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
---
 arch/powerpc/kernel/smp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Nov. 14, 2023, 12:14 p.m. UTC | #1
Hi Timothy,

Thanks for debugging this, but I'm unclear why this is helping because
we should already have a full barrier (hwsync) on both the sending and
receiving side.

More below.

Timothy Pearson <tpearson@raptorengineering.com> writes:
> From 0b2678b7cdada1a3d9aec8626f31a988d81373fa Mon Sep 17 00:00:00 2001
> From: Timothy Pearson <tpearson@raptorengineering.com>
> Date: Mon, 13 Nov 2023 22:42:58 -0600
> Subject: [PATCH] powerpc: Fix data corruption on IPI
>
> On multithreaded SMP workloads such as those using io_uring, it is possible for
> multiple threads to hold an inconsistent view of system memory when an IPI is
> issued.  This in turn leads to userspace memory corruption with varying degrees
> of probability based on workload and inter-thread timing.
>
> io_uring provokes this bug by its use of TWA_SIGNAL during thread creation,
> which is especially noticeable as significant userspace data corruption with
> certain workloads such as MariaDB (bug MDEV-30728).  While using
> TWA_SIGNAL_NO_IPI works around the corruption, no other architecture requires
> this workaround.
>
> Issue an lwsync barrier instruction prior to sending the IPI.  This ensures
> the receiving CPU has a consistent view of system memory, in line with other
> architectures.
>
> Tested under QEMU in kvm mode, running on a Talos II workstation with dual
> POWER9 DD2.2 CPUs.
>
> Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
> ---
>  arch/powerpc/kernel/smp.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index ab691c89d787..ba42238de518 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg)
>  
>  void arch_smp_send_reschedule(int cpu)
>  {
> -	if (likely(smp_ops))
> +	if (likely(smp_ops)) {
> +		__smp_lwsync();
>  		do_message_pass(cpu, PPC_MSG_RESCHEDULE);
> +	}
>  }


Expanding the code:

static inline void do_message_pass(int cpu, int msg)
{
	if (smp_ops->message_pass)
		smp_ops->message_pass(cpu, msg);
#ifdef CONFIG_PPC_SMP_MUXED_IPI
	else
		smp_muxed_ipi_message_pass(cpu, msg);
#endif
}

The kernel should be using smp_muxed_ipi_message_pass() on your machine, so:

void smp_muxed_ipi_message_pass(int cpu, int msg)
{
	smp_muxed_ipi_set_message(cpu, msg);

	/*
	 * cause_ipi functions are required to include a full barrier
	 * before doing whatever causes the IPI.
	 */
	smp_ops->cause_ipi(cpu);
}

void smp_muxed_ipi_set_message(int cpu, int msg)
{
	struct cpu_messages *info = &per_cpu(ipi_message, cpu);
	char *message = (char *)&info->messages;

	/*
	 * Order previous accesses before accesses in the IPI handler.
	 */
	smp_mb();
	WRITE_ONCE(message[msg], 1);
}

Where that smp_mb() expands to hwsync (aka sync).

The receiving side is:

irqreturn_t smp_ipi_demux(void)
{
	mb();	/* order any irq clear */

	return smp_ipi_demux_relaxed();
}

irqreturn_t smp_ipi_demux_relaxed(void)
{
	struct cpu_messages *info;
	unsigned long all;

	info = this_cpu_ptr(&ipi_message);
	do {
		all = xchg(&info->messages, 0);
                ...
		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
			scheduler_ipi();


The hwsync is not in *exactly* the same place as the lwsync you added.
But the only accesses between the lwsync and the hwsync should be the
load of ipi_messages, and I don't see why that should matter.

There must be something going wrong in this path though for your patch
to fix the issue for you, but I'm not sure how the barrier is fixing it.

You said you tested in a VM, I guess it's not easy for you to test on
the host?

I see now there's a long thread on this which I haven't read so I'll go
and read that. (Can you cc linuxppc-dev next time please? :D)

One thing you could try is just adding xnops in that location and see if
some number of them also makes the bug go away, eg:

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5826f5108a12..6c84bf55557e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg)

 void arch_smp_send_reschedule(int cpu)
 {
-       if (likely(smp_ops))
+       if (likely(smp_ops)) {
+               asm volatile ("xnop");
                do_message_pass(cpu, PPC_MSG_RESCHEDULE);
+       }
 }
 EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);


cheers
Timothy Pearson Nov. 14, 2023, 9:32 p.m. UTC | #2
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Cc: "Jens Axboe" <axboe@kernel.dk>
> Sent: Tuesday, November 14, 2023 6:14:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> Hi Timothy,
> 
> Thanks for debugging this, but I'm unclear why this is helping because
> we should already have a full barrier (hwsync) on both the sending and
> receiving side.
> 
> More below.

So it looks like we might be dealing with a couple of potentially separate issues here, not sure.  This is probably the most infuriating bug I've run across in my career, so please bear with me -- with the amount of active test kernels I have installed at any one time I'm occassionally infecting one with the tests from another. ;)

First, I'm not 100% convinced the code in try_to_wake_up() is race-free on ppc64, since adding a memory barrier between it and the call to kick_process() significantly reduces the frequency of the on-disk corruption.  Can you take a look and see if anything stands out as to why that would be important?

The second part of this though is that the barrier only reduces the frequency of the corruption, it does not eliminate the corruption.  After some consolidation, what completely eliminates the corruption is a combination of:
 * Switching to TWA_SIGNAL_NO_IPI in task_work_add() * 
 * Adding udelay(1000) in io_uring/rw.c:io_write(), right before the call to io_rw_init_file()

Adding a memory barrier instead of the udelay() doesn't work, nor does adding the delay without switching to NO_IPI.

[1] Keeping TWA_SIGNAL and adding the barrier instruction also works, but this is conceptually simpler to understand as to why it would have an effect at all
Nicholas Piggin Nov. 15, 2023, 3:11 a.m. UTC | #3
(Sorry I didn't see that Michael already made the same comment,
ignore my previous.)

On Wed Nov 15, 2023 at 7:32 AM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe@ellerman.id.au>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Cc: "Jens Axboe" <axboe@kernel.dk>
> > Sent: Tuesday, November 14, 2023 6:14:37 AM
> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>
> > Hi Timothy,
> > 
> > Thanks for debugging this, but I'm unclear why this is helping because
> > we should already have a full barrier (hwsync) on both the sending and
> > receiving side.
> > 
> > More below.
>
> So it looks like we might be dealing with a couple of potentially separate issues here, not sure.  This is probably the most infuriating bug I've run across in my career, so please bear with me -- with the amount of active test kernels I have installed at any one time I'm occassionally infecting one with the tests from another. ;)
>
> First, I'm not 100% convinced the code in try_to_wake_up() is race-free on ppc64, since adding a memory barrier between it and the call to kick_process() significantly reduces the frequency of the on-disk corruption.  Can you take a look and see if anything stands out as to why that would be important?

We have had memory ordering bugs in the scheduler before, but normally
they would result in a lost wakeup (hang) or crash in the scheduler.
AFAIK things that sleep are not supposed to assume they won't get an
interrupt before the event they are waiting on.

Where are you adding the barrier?

If it only reduces corruption then it seems like the race or ordering
bug is elsewhere (but maybe nearby).

>
> The second part of this though is that the barrier only reduces the frequency of the corruption, it does not eliminate the corruption.  After some consolidation, what completely eliminates the corruption is a combination of:
>  * Switching to TWA_SIGNAL_NO_IPI in task_work_add() * 
>  * Adding udelay(1000) in io_uring/rw.c:io_write(), right before the call to io_rw_init_file()

>
> Adding a memory barrier instead of the udelay() doesn't work, nor does adding the delay without switching to NO_IPI.

And just NO_IPI alone doesn't eliminate it?

>
> [1] Keeping TWA_SIGNAL and adding the barrier instruction also works, but this is conceptually simpler to understand as to why it would have an effect at all

Which barrier, the ttwu one? And by work you mean fixes completely?

Still smells like a bug in io uring code, possibly MySQL. Is it only
MySQL it's been seen with do you know? I don't suppose there are any
clues about the corruption pattern or when or where it appears?

Thanks,
Nick
Timothy Pearson Nov. 17, 2023, 7:39 a.m. UTC | #4
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Cc: "Jens Axboe" <axboe@kernel.dk>
> Sent: Tuesday, November 14, 2023 6:14:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> Hi Timothy,
> 
> Thanks for debugging this, but I'm unclear why this is helping because
> we should already have a full barrier (hwsync) on both the sending and
> receiving side.
> 
> More below.

I've spent another few days poking at this, and think I might finally have something more solid in terms of what exactly is happening, but would like some feedback on the concept / how best to fix the potential problem.

As background, there are several worker threads both in userspace and in kernel mode.  Crucially, the main MariaDB data processing thread (the one that handles tasks like flushing dirty pages to disk) always runs on the same core as the io_uring kernel thread that picks up I/O worker creation requests and handles them via create_worker_cb().

Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring currently sets up the new process with create_io_thread(), and immediately uses an IPI to forcibly schedule the new process.  Because of the way the two threads interact, the new process ends up grabbing the CPU from the running MariaDB user thread; I've never seen it schedule on a different core.  If the timing is right in this process, things get trampled on in userspace and the database server either crashes or throws a corruption fault.

Through extensive debugging, I've narrowed this down to invalid state in the VSX registers on return to the MariaDB user thread from the new kernel thread.  For some reason, it seems we don't restore FP state on return from the PF_IO_WORKER thread, and something in the kernel was busy writing new data to them.

A direct example I was able to observe is as follows:

xxspltd vs0,vs0,0      <-- vs0 now zeroed out
xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here due to the IPI
slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the value 0x820040000000000082004000
xxswapd vs8,vs0        <-- vs8 now has the wrong value
stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
stw     r9,116(r3)
stxvd2x vs8,r3,r0
...
CRASH

This is a very difficult race to hit, but MariaDB naturally does repetitive operations with VSX registers so it does eventually fail.  I ended up with a tight loop around glibc operations that use VSX to trigger the failure reliably enough to even understand what was going on.

As I am not as familiar with this part of the Linux kernel as with most other areas, what is the expected save/restore path for the FP/VSX registers around an IPI and associated forced thread switch?  If restore_math() is in the return path, note that MSR_FP is set in regs->msr.

Second question: should we even be using the VSX registers at all in kernel space?  Is this a side effect of io_uring interacting so closely with userspace threads, or something else entirely?

If I can get pointed somewhat in the right direction I'm ready to develop the rest of the fix for this issue...just trying to avoid another several days of slogging through the source to see what it's supposed to be doing in the first place. :)

Thanks!
Timothy Pearson Nov. 17, 2023, 7:52 a.m. UTC | #5
----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineeringinc.com>
> To: "Michael Ellerman" <mpe@ellerman.id.au>, "npiggin" <npiggin@gmail.com>
> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, November 17, 2023 1:39:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> ----- Original Message -----
>> From: "Michael Ellerman" <mpe@ellerman.id.au>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev"
>> <linuxppc-dev@lists.ozlabs.org>
>> Cc: "Jens Axboe" <axboe@kernel.dk>
>> Sent: Tuesday, November 14, 2023 6:14:37 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> Hi Timothy,
>> 
>> Thanks for debugging this, but I'm unclear why this is helping because
>> we should already have a full barrier (hwsync) on both the sending and
>> receiving side.
>> 
>> More below.
> 
> I've spent another few days poking at this, and think I might finally have
> something more solid in terms of what exactly is happening, but would like some
> feedback on the concept / how best to fix the potential problem.
> 
> As background, there are several worker threads both in userspace and in kernel
> mode.  Crucially, the main MariaDB data processing thread (the one that handles
> tasks like flushing dirty pages to disk) always runs on the same core as the
> io_uring kernel thread that picks up I/O worker creation requests and handles
> them via create_worker_cb().
> 
> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
> currently sets up the new process with create_io_thread(), and immediately uses
> an IPI to forcibly schedule the new process.  Because of the way the two
> threads interact, the new process ends up grabbing the CPU from the running
> MariaDB user thread; I've never seen it schedule on a different core.  If the
> timing is right in this process, things get trampled on in userspace and the
> database server either crashes or throws a corruption fault.
> 
> Through extensive debugging, I've narrowed this down to invalid state in the VSX
> registers on return to the MariaDB user thread from the new kernel thread.  For
> some reason, it seems we don't restore FP state on return from the PF_IO_WORKER
> thread, and something in the kernel was busy writing new data to them.
> 
> A direct example I was able to observe is as follows:
> 
> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here
> due to the IPI
> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
> value 0x820040000000000082004000
> xxswapd vs8,vs0        <-- vs8 now has the wrong value
> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
> stw     r9,116(r3)
> stxvd2x vs8,r3,r0
> ...
> CRASH
> 
> This is a very difficult race to hit, but MariaDB naturally does repetitive
> operations with VSX registers so it does eventually fail.  I ended up with a
> tight loop around glibc operations that use VSX to trigger the failure reliably
> enough to even understand what was going on.
> 
> As I am not as familiar with this part of the Linux kernel as with most other
> areas, what is the expected save/restore path for the FP/VSX registers around
> an IPI and associated forced thread switch?  If restore_math() is in the return
> path, note that MSR_FP is set in regs->msr.
> 
> Second question: should we even be using the VSX registers at all in kernel
> space?  Is this a side effect of io_uring interacting so closely with userspace
> threads, or something else entirely?

Thinking a bit more, a third option could be if we're restoring garbage into the registers by accident.  I know the I/O worker threads claim to use a "lightweight" version of kernel_clone(), and it'd be easy to have missed something important...
Nicholas Piggin Nov. 17, 2023, 8:01 a.m. UTC | #6
On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe@ellerman.id.au>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Cc: "Jens Axboe" <axboe@kernel.dk>
> > Sent: Tuesday, November 14, 2023 6:14:37 AM
> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>
> > Hi Timothy,
> > 
> > Thanks for debugging this, but I'm unclear why this is helping because
> > we should already have a full barrier (hwsync) on both the sending and
> > receiving side.
> > 
> > More below.
>
> I've spent another few days poking at this, and think I might finally have something more solid in terms of what exactly is happening, but would like some feedback on the concept / how best to fix the potential problem.
>
> As background, there are several worker threads both in userspace and in kernel mode.  Crucially, the main MariaDB data processing thread (the one that handles tasks like flushing dirty pages to disk) always runs on the same core as the io_uring kernel thread that picks up I/O worker creation requests and handles them via create_worker_cb().
>
> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring currently sets up the new process with create_io_thread(), and immediately uses an IPI to forcibly schedule the new process.  Because of the way the two threads interact, the new process ends up grabbing the CPU from the running MariaDB user thread; I've never seen it schedule on a different core.  If the timing is right in this process, things get trampled on in userspace and the database server either crashes or throws a corruption fault.
>
> Through extensive debugging, I've narrowed this down to invalid state in the VSX registers on return to the MariaDB user thread from the new kernel thread.  For some reason, it seems we don't restore FP state on return from the PF_IO_WORKER thread, and something in the kernel was busy writing new data to them.
>
> A direct example I was able to observe is as follows:
>
> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here due to the IPI
> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the value 0x820040000000000082004000
> xxswapd vs8,vs0        <-- vs8 now has the wrong value
> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
> stw     r9,116(r3)
> stxvd2x vs8,r3,r0
> ...
> CRASH

Nice find, that looks pretty conclusive.

> This is a very difficult race to hit, but MariaDB naturally does repetitive operations with VSX registers so it does eventually fail.  I ended up with a tight loop around glibc operations that use VSX to trigger the failure reliably enough to even understand what was going on.
>
> As I am not as familiar with this part of the Linux kernel as with most other areas, what is the expected save/restore path for the FP/VSX registers around an IPI and associated forced thread switch?  If restore_math() is in the return path, note that MSR_FP is set in regs->msr.

Context switching these FP/vec registers should be pretty robust in
general because it's not just io-uring that uses them. io-uring could
be using some uncommon kernel code that uses the registers incorrectly
though I guess.

>
> Second question: should we even be using the VSX registers at all in kernel space?  Is this a side effect of io_uring interacting so closely with userspace threads, or something else entirely?
>
> If I can get pointed somewhat in the right direction I'm ready to develop the rest of the fix for this issue...just trying to avoid another several days of slogging through the source to see what it's supposed to be doing in the first place. :)

Kernel can use FP/VEC/VSX registers but it has to enable and disable
explicitly. Such kernel code also should not be preemptible.

enable|disable_kernel_fp|altivec|vsx().

Maybe try run the test with ppc_strict_facility_enable boot option to
start with.

If no luck with that, maybe put WARN_ON(preemptible()); checks also in
the disable_kernel_* functions.

You could also add an enable/disable counter for each, and make sure it
is balanced on context switch or kernel->userspace exit.

Thanks,
Nick
Timothy Pearson Nov. 17, 2023, 8:20 a.m. UTC | #7
----- Original Message -----
> From: "npiggin" <npiggin@gmail.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "Michael Ellerman" <mpe@ellerman.id.au>
> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, November 17, 2023 2:01:12 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>
>>
>> ----- Original Message -----
>> > From: "Michael Ellerman" <mpe@ellerman.id.au>
>> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev"
>> > <linuxppc-dev@lists.ozlabs.org>
>> > Cc: "Jens Axboe" <axboe@kernel.dk>
>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>
>> > Hi Timothy,
>> > 
>> > Thanks for debugging this, but I'm unclear why this is helping because
>> > we should already have a full barrier (hwsync) on both the sending and
>> > receiving side.
>> > 
>> > More below.
>>
>> I've spent another few days poking at this, and think I might finally have
>> something more solid in terms of what exactly is happening, but would like some
>> feedback on the concept / how best to fix the potential problem.
>>
>> As background, there are several worker threads both in userspace and in kernel
>> mode.  Crucially, the main MariaDB data processing thread (the one that handles
>> tasks like flushing dirty pages to disk) always runs on the same core as the
>> io_uring kernel thread that picks up I/O worker creation requests and handles
>> them via create_worker_cb().
>>
>> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
>> currently sets up the new process with create_io_thread(), and immediately uses
>> an IPI to forcibly schedule the new process.  Because of the way the two
>> threads interact, the new process ends up grabbing the CPU from the running
>> MariaDB user thread; I've never seen it schedule on a different core.  If the
>> timing is right in this process, things get trampled on in userspace and the
>> database server either crashes or throws a corruption fault.
>>
>> Through extensive debugging, I've narrowed this down to invalid state in the VSX
>> registers on return to the MariaDB user thread from the new kernel thread.  For
>> some reason, it seems we don't restore FP state on return from the PF_IO_WORKER
>> thread, and something in the kernel was busy writing new data to them.
>>
>> A direct example I was able to observe is as follows:
>>
>> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
>> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here
>> due to the IPI
>> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
>> value 0x820040000000000082004000
>> xxswapd vs8,vs0        <-- vs8 now has the wrong value
>> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
>> stw     r9,116(r3)
>> stxvd2x vs8,r3,r0
>> ...
>> CRASH
> 
> Nice find, that looks pretty conclusive.
> 
>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>> operations with VSX registers so it does eventually fail.  I ended up with a
>> tight loop around glibc operations that use VSX to trigger the failure reliably
>> enough to even understand what was going on.
>>
>> As I am not as familiar with this part of the Linux kernel as with most other
>> areas, what is the expected save/restore path for the FP/VSX registers around
>> an IPI and associated forced thread switch?  If restore_math() is in the return
>> path, note that MSR_FP is set in regs->msr.
> 
> Context switching these FP/vec registers should be pretty robust in
> general because it's not just io-uring that uses them. io-uring could
> be using some uncommon kernel code that uses the registers incorrectly
> though I guess.
> 
>>
>> Second question: should we even be using the VSX registers at all in kernel
>> space?  Is this a side effect of io_uring interacting so closely with userspace
>> threads, or something else entirely?
>>
>> If I can get pointed somewhat in the right direction I'm ready to develop the
>> rest of the fix for this issue...just trying to avoid another several days of
>> slogging through the source to see what it's supposed to be doing in the first
>> place. :)
> 
> Kernel can use FP/VEC/VSX registers but it has to enable and disable
> explicitly. Such kernel code also should not be preemptible.
> 
> enable|disable_kernel_fp|altivec|vsx().
> 
> Maybe try run the test with ppc_strict_facility_enable boot option to
> start with.
> 
> If no luck with that, maybe put WARN_ON(preemptible()); checks also in
> the disable_kernel_* functions.
> 
> You could also add an enable/disable counter for each, and make sure it
> is balanced on context switch or kernel->userspace exit.
> 
> Thanks,
> Nick

Will do, thanks for the hints!

I had a debug idea just as I sent the earlier message, and was able to confirm the kernel is purposefully restoring the bad data in at least one spot in the thread's history, though curiously *not* right before everything goes off the rails.  I also dumped the entire kernel binary and confirmed it isn't touching the vs* registers, so overall this is leaning more toward a bad value being restored than kernel code inadvertently making use of the vector registers.

An I missing anything other than do_restore_fp() that could touch the vs* registers around an interrupt-driven task switch?
Timothy Pearson Nov. 17, 2023, 8:26 a.m. UTC | #8
----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "npiggin" <npiggin@gmail.com>
> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, November 17, 2023 2:20:29 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> ----- Original Message -----
>> From: "npiggin" <npiggin@gmail.com>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "Michael Ellerman"
>> <mpe@ellerman.id.au>
>> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>> Sent: Friday, November 17, 2023 2:01:12 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>> > From: "Michael Ellerman" <mpe@ellerman.id.au>
>>> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev"
>>> > <linuxppc-dev@lists.ozlabs.org>
>>> > Cc: "Jens Axboe" <axboe@kernel.dk>
>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>>
>>> > Hi Timothy,
>>> > 
>>> > Thanks for debugging this, but I'm unclear why this is helping because
>>> > we should already have a full barrier (hwsync) on both the sending and
>>> > receiving side.
>>> > 
>>> > More below.
>>>
>>> I've spent another few days poking at this, and think I might finally have
>>> something more solid in terms of what exactly is happening, but would like some
>>> feedback on the concept / how best to fix the potential problem.
>>>
>>> As background, there are several worker threads both in userspace and in kernel
>>> mode.  Crucially, the main MariaDB data processing thread (the one that handles
>>> tasks like flushing dirty pages to disk) always runs on the same core as the
>>> io_uring kernel thread that picks up I/O worker creation requests and handles
>>> them via create_worker_cb().
>>>
>>> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
>>> currently sets up the new process with create_io_thread(), and immediately uses
>>> an IPI to forcibly schedule the new process.  Because of the way the two
>>> threads interact, the new process ends up grabbing the CPU from the running
>>> MariaDB user thread; I've never seen it schedule on a different core.  If the
>>> timing is right in this process, things get trampled on in userspace and the
>>> database server either crashes or throws a corruption fault.
>>>
>>> Through extensive debugging, I've narrowed this down to invalid state in the VSX
>>> registers on return to the MariaDB user thread from the new kernel thread.  For
>>> some reason, it seems we don't restore FP state on return from the PF_IO_WORKER
>>> thread, and something in the kernel was busy writing new data to them.
>>>
>>> A direct example I was able to observe is as follows:
>>>
>>> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
>>> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here
>>> due to the IPI
>>> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
>>> value 0x820040000000000082004000
>>> xxswapd vs8,vs0        <-- vs8 now has the wrong value
>>> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
>>> stw     r9,116(r3)
>>> stxvd2x vs8,r3,r0
>>> ...
>>> CRASH
>> 
>> Nice find, that looks pretty conclusive.
>> 
>>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>>> operations with VSX registers so it does eventually fail.  I ended up with a
>>> tight loop around glibc operations that use VSX to trigger the failure reliably
>>> enough to even understand what was going on.
>>>
>>> As I am not as familiar with this part of the Linux kernel as with most other
>>> areas, what is the expected save/restore path for the FP/VSX registers around
>>> an IPI and associated forced thread switch?  If restore_math() is in the return
>>> path, note that MSR_FP is set in regs->msr.
>> 
>> Context switching these FP/vec registers should be pretty robust in
>> general because it's not just io-uring that uses them. io-uring could
>> be using some uncommon kernel code that uses the registers incorrectly
>> though I guess.
>> 
>>>
>>> Second question: should we even be using the VSX registers at all in kernel
>>> space?  Is this a side effect of io_uring interacting so closely with userspace
>>> threads, or something else entirely?
>>>
>>> If I can get pointed somewhat in the right direction I'm ready to develop the
>>> rest of the fix for this issue...just trying to avoid another several days of
>>> slogging through the source to see what it's supposed to be doing in the first
>>> place. :)
>> 
>> Kernel can use FP/VEC/VSX registers but it has to enable and disable
>> explicitly. Such kernel code also should not be preemptible.
>> 
>> enable|disable_kernel_fp|altivec|vsx().
>> 
>> Maybe try run the test with ppc_strict_facility_enable boot option to
>> start with.
>> 
>> If no luck with that, maybe put WARN_ON(preemptible()); checks also in
>> the disable_kernel_* functions.
>> 
>> You could also add an enable/disable counter for each, and make sure it
>> is balanced on context switch or kernel->userspace exit.
>> 
>> Thanks,
>> Nick
> 
> Will do, thanks for the hints!
> 
> I had a debug idea just as I sent the earlier message, and was able to confirm
> the kernel is purposefully restoring the bad data in at least one spot in the
> thread's history, though curiously *not* right before everything goes off the
> rails.  I also dumped the entire kernel binary and confirmed it isn't touching
> the vs* registers, so overall this is leaning more toward a bad value being
> restored than kernel code inadvertently making use of the vector registers.
> 
> An I missing anything other than do_restore_fp() that could touch the vs*
> registers around an interrupt-driven task switch?

One other piece of this puzzle -- I'm running this via qemu in kvm mode.  I noticed there is some code that touches FP state in the kvm tree, any way we could be having a problem lower down the stack (i.e. at hypervisor level) that would manifest this way in the guest?

I can try to test on bare metal tomorrow to rule that out one way or the other.
Timothy Pearson Nov. 17, 2023, 8:54 a.m. UTC | #9
----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "npiggin" <npiggin@gmail.com>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, November 17, 2023 2:26:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> ----- Original Message -----
>> From: "Timothy Pearson" <tpearson@raptorengineering.com>
>> To: "npiggin" <npiggin@gmail.com>
>> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>> Sent: Friday, November 17, 2023 2:20:29 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> ----- Original Message -----
>>> From: "npiggin" <npiggin@gmail.com>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "Michael Ellerman"
>>> <mpe@ellerman.id.au>
>>> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>>> Sent: Friday, November 17, 2023 2:01:12 AM
>>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>> 
>>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>> > From: "Michael Ellerman" <mpe@ellerman.id.au>
>>>> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linuxppc-dev"
>>>> > <linuxppc-dev@lists.ozlabs.org>
>>>> > Cc: "Jens Axboe" <axboe@kernel.dk>
>>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>>>
>>>> > Hi Timothy,
>>>> > 
>>>> > Thanks for debugging this, but I'm unclear why this is helping because
>>>> > we should already have a full barrier (hwsync) on both the sending and
>>>> > receiving side.
>>>> > 
>>>> > More below.
>>>>
>>>> I've spent another few days poking at this, and think I might finally have
>>>> something more solid in terms of what exactly is happening, but would like some
>>>> feedback on the concept / how best to fix the potential problem.
>>>>
>>>> As background, there are several worker threads both in userspace and in kernel
>>>> mode.  Crucially, the main MariaDB data processing thread (the one that handles
>>>> tasks like flushing dirty pages to disk) always runs on the same core as the
>>>> io_uring kernel thread that picks up I/O worker creation requests and handles
>>>> them via create_worker_cb().
>>>>
>>>> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
>>>> currently sets up the new process with create_io_thread(), and immediately uses
>>>> an IPI to forcibly schedule the new process.  Because of the way the two
>>>> threads interact, the new process ends up grabbing the CPU from the running
>>>> MariaDB user thread; I've never seen it schedule on a different core.  If the
>>>> timing is right in this process, things get trampled on in userspace and the
>>>> database server either crashes or throws a corruption fault.
>>>>
>>>> Through extensive debugging, I've narrowed this down to invalid state in the VSX
>>>> registers on return to the MariaDB user thread from the new kernel thread.  For
>>>> some reason, it seems we don't restore FP state on return from the PF_IO_WORKER
>>>> thread, and something in the kernel was busy writing new data to them.
>>>>
>>>> A direct example I was able to observe is as follows:
>>>>
>>>> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
>>>> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here
>>>> due to the IPI
>>>> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
>>>> value 0x820040000000000082004000
>>>> xxswapd vs8,vs0        <-- vs8 now has the wrong value
>>>> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
>>>> stw     r9,116(r3)
>>>> stxvd2x vs8,r3,r0
>>>> ...
>>>> CRASH
>>> 
>>> Nice find, that looks pretty conclusive.
>>> 
>>>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>>>> operations with VSX registers so it does eventually fail.  I ended up with a
>>>> tight loop around glibc operations that use VSX to trigger the failure reliably
>>>> enough to even understand what was going on.
>>>>
>>>> As I am not as familiar with this part of the Linux kernel as with most other
>>>> areas, what is the expected save/restore path for the FP/VSX registers around
>>>> an IPI and associated forced thread switch?  If restore_math() is in the return
>>>> path, note that MSR_FP is set in regs->msr.
>>> 
>>> Context switching these FP/vec registers should be pretty robust in
>>> general because it's not just io-uring that uses them. io-uring could
>>> be using some uncommon kernel code that uses the registers incorrectly
>>> though I guess.
>>> 
>>>>
>>>> Second question: should we even be using the VSX registers at all in kernel
>>>> space?  Is this a side effect of io_uring interacting so closely with userspace
>>>> threads, or something else entirely?
>>>>
>>>> If I can get pointed somewhat in the right direction I'm ready to develop the
>>>> rest of the fix for this issue...just trying to avoid another several days of
>>>> slogging through the source to see what it's supposed to be doing in the first
>>>> place. :)
>>> 
>>> Kernel can use FP/VEC/VSX registers but it has to enable and disable
>>> explicitly. Such kernel code also should not be preemptible.
>>> 
>>> enable|disable_kernel_fp|altivec|vsx().
>>> 
>>> Maybe try run the test with ppc_strict_facility_enable boot option to
>>> start with.
>>> 
>>> If no luck with that, maybe put WARN_ON(preemptible()); checks also in
>>> the disable_kernel_* functions.
>>> 
>>> You could also add an enable/disable counter for each, and make sure it
>>> is balanced on context switch or kernel->userspace exit.
>>> 
>>> Thanks,
>>> Nick
>> 
>> Will do, thanks for the hints!
>> 
>> I had a debug idea just as I sent the earlier message, and was able to confirm
>> the kernel is purposefully restoring the bad data in at least one spot in the
>> thread's history, though curiously *not* right before everything goes off the
>> rails.  I also dumped the entire kernel binary and confirmed it isn't touching
>> the vs* registers, so overall this is leaning more toward a bad value being
>> restored than kernel code inadvertently making use of the vector registers.
>> 
>> An I missing anything other than do_restore_fp() that could touch the vs*
>> registers around an interrupt-driven task switch?
> 
> One other piece of this puzzle -- I'm running this via qemu in kvm mode.  I
> noticed there is some code that touches FP state in the kvm tree, any way we
> could be having a problem lower down the stack (i.e. at hypervisor level) that
> would manifest this way in the guest?
> 
> I can try to test on bare metal tomorrow to rule that out one way or the other.

Just tried on bare metal, same corruption issues remain so I'm going to assume qemu / kvm is not the issue here.
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ab691c89d787..ba42238de518 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -369,8 +369,10 @@  static inline void do_message_pass(int cpu, int msg)
 
 void arch_smp_send_reschedule(int cpu)
 {
-	if (likely(smp_ops))
+	if (likely(smp_ops)) {
+		__smp_lwsync();
 		do_message_pass(cpu, PPC_MSG_RESCHEDULE);
+	}
 }
 EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);