diff mbox

powerpc: Don't clear larx reservation on system call exit

Message ID 20100215014028.GB13355@kryten (mailing list archive)
State Changes Requested
Delegated to: Benjamin Herrenschmidt
Headers show

Commit Message

Anton Blanchard Feb. 15, 2010, 1:40 a.m. UTC
Right now we clear the larx reservation on every system call exit. No code
should exist that tries to make use of larx/stcx across a system call (because
it would fail 100% of the time).

We could continue to play it safe but system call latency affects so many
workloads. In the past we have already cut down the set of registers we
save and restore across a system call and this could be seen as an
extension of that. The PowerPC system call ABI does not (and could not)
preserve a larx reservation.

On POWER6 the poster child for system call improvements, getppid, improves 6%.

A more useful test is the private futex wake system call and that improves 5%.
This is a decent speedup on an important system call for threaded applications.

Signed-off-by: Anton Blanchard <anton@samba.org>
---
If my previous patches didn't worry you then this one is sure to.

Getting this wrong will make someone's life miserable, so it could do with
some double checking (eg we don't branch through there on other exceptions and
we dont invoke system calls from the kernel that rely on the reservation being
cleared).

Comments

Benjamin Herrenschmidt Feb. 15, 2010, 2:24 a.m. UTC | #1
On Mon, 2010-02-15 at 12:40 +1100, Anton Blanchard wrote:
> Right now we clear the larx reservation on every system call exit. No code
> should exist that tries to make use of larx/stcx across a system call (because
> it would fail 100% of the time).
> 
> We could continue to play it safe but system call latency affects so many
> workloads. In the past we have already cut down the set of registers we
> save and restore across a system call and this could be seen as an
> extension of that. The PowerPC system call ABI does not (and could not)
> preserve a larx reservation.
> 
> On POWER6 the poster child for system call improvements, getppid, improves 6%.
> 
> A more useful test is the private futex wake system call and that improves 5%.
> This is a decent speedup on an important system call for threaded applications.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> If my previous patches didn't worry you then this one is sure to.
> 
> Getting this wrong will make someone's life miserable, so it could do with
> some double checking (eg we don't branch through there on other exceptions and
> we dont invoke system calls from the kernel that rely on the reservation being
> cleared).

Well, the main issue here is leaking kernel reservations into userspace,
and thus the question of whether it is a big deal or not. There's also
an issue I can see with signals.

The risk with kernel reservations leaking into userspace is a problem on
some processors that do not compare the reservation address locally
(only for snoops), thus userspace code doing lwarx/syscall/stwcx. might
end up with a suceeding stwcx. despite the fact that the original
reservation was long lost. 

At this stage it becomes an ABI problem, ie, whether we define the
behaviour of a lwarx/stwcx. accross a syscall as defined or not.

The other problem I see is that signal handlers would have to be made
very careful not to leave dangling reservations since the return from
the syscall is a syscall, unless we add code specifically to this (and
set_context too I'd say) to clear reservations.

IE. You could have something like:

lwarx, <interrupt>, signal handler, sigreturn, stwcx.

In the above case, the reservation would be cleared by the return from
the interrupt, but the signal handler might leave a dangling one, which
sigreturn might fail to clear (in practice, our current implementation
of sys_sigreturn() will probably clear any reservation as a side effect
of restore_sigmask() spinlock or set_thread_flag() but it sounds a bit
fragile to rely on unless it's well documented). 

Cheers,
Ben.

> Index: powerpc.git/arch/powerpc/kernel/entry_64.S
> ===================================================================
> --- powerpc.git.orig/arch/powerpc/kernel/entry_64.S	2010-02-13 16:26:43.794322638 +1100
> +++ powerpc.git/arch/powerpc/kernel/entry_64.S	2010-02-13 16:27:03.205575405 +1100
> @@ -202,7 +202,6 @@ syscall_exit:
>  	bge-	syscall_error
>  syscall_error_cont:
>  	ld	r7,_NIP(r1)
> -	stdcx.	r0,0,r1			/* to clear the reservation */
>  	andi.	r6,r8,MSR_PR
>  	ld	r4,_LINK(r1)
>  	/*
Anton Blanchard Feb. 15, 2010, 4:06 a.m. UTC | #2
Hi Ben,

> Well, the main issue here is leaking kernel reservations into userspace,
> and thus the question of whether it is a big deal or not. There's also
> an issue I can see with signals.
> 
> The risk with kernel reservations leaking into userspace is a problem on
> some processors that do not compare the reservation address locally
> (only for snoops), thus userspace code doing lwarx/syscall/stwcx. might
> end up with a suceeding stwcx. despite the fact that the original
> reservation was long lost. 

Yeah that was my primary concern. Right now these things fail 100%, so
no one is relying on it. The worry is if people start writing their own
crazy low level system call + locking stubs that might work most of the
time (if we remove the stwcx in syscall exit).

> At this stage it becomes an ABI problem, ie, whether we define the
> behaviour of a lwarx/stwcx. accross a syscall as defined or not.
> 
> The other problem I see is that signal handlers would have to be made
> very careful not to leave dangling reservations since the return from
> the syscall is a syscall, unless we add code specifically to this (and
> set_context too I'd say) to clear reservations.
> 
> IE. You could have something like:
> 
> lwarx, <interrupt>, signal handler, sigreturn, stwcx.
> 
> In the above case, the reservation would be cleared by the return from
> the interrupt, but the signal handler might leave a dangling one, which
> sigreturn might fail to clear (in practice, our current implementation
> of sys_sigreturn() will probably clear any reservation as a side effect
> of restore_sigmask() spinlock or set_thread_flag() but it sounds a bit
> fragile to rely on unless it's well documented). 

Good point, I hadn't thought of signals and I agree we'd need to clear the
reservation in the sigreturn path.

Anton
Benjamin Herrenschmidt Feb. 15, 2010, 4:15 a.m. UTC | #3
On Mon, 2010-02-15 at 15:06 +1100, Anton Blanchard wrote:
> 
> Yeah that was my primary concern. Right now these things fail 100%, so
> no one is relying on it. The worry is if people start writing their own
> crazy low level system call + locking stubs that might work most of the
> time (if we remove the stwcx in syscall exit).

Worse than that. It will look like it works, but it won't really since
the dangling reservation matches a completely unrelated lwarx done by
the kernel and not the original userspace one, so the stwcx. might
succeed despite the fact that the original value -was- changed.

So it's really a matter of documentation I suppose but we have to be
careful as I've learned with time that there is nothing too silly for
userspace to do :-)

> Good point, I hadn't thought of signals and I agree we'd need to clear the
> reservation in the sigreturn path.

Right. As I said, I'm pretty sure it will happen as a side effect of
other things but I'd rather pay the small price of having an explicit
blurb to do it in sigreturn regardless.

Cheers,
Ben.
diff mbox

Patch

Index: powerpc.git/arch/powerpc/kernel/entry_64.S
===================================================================
--- powerpc.git.orig/arch/powerpc/kernel/entry_64.S	2010-02-13 16:26:43.794322638 +1100
+++ powerpc.git/arch/powerpc/kernel/entry_64.S	2010-02-13 16:27:03.205575405 +1100
@@ -202,7 +202,6 @@  syscall_exit:
 	bge-	syscall_error
 syscall_error_cont:
 	ld	r7,_NIP(r1)
-	stdcx.	r0,0,r1			/* to clear the reservation */
 	andi.	r6,r8,MSR_PR
 	ld	r4,_LINK(r1)
 	/*