Message ID | 1521045375.11552.27.camel@synopsys.com |
---|---|
State | New |
Headers | show |
Series | arc_usr_cmpxchg and preemption | expand |
+CC linux-arch, Peter for any preemption insights ! On 03/14/2018 09:36 AM, Alexey Brodkin wrote: > Hi Vineet, > > While debugging a segfault of user-space app on system without atomic ops > (I mean LLOCK/SCOND) I understood the root-cause is in implementation > of kernel's __NR_arc_usr_cmpxchg syscall which is supposed to emulate mentioned > atomic ops for user-space. > > So here's a problem. > > 1. User-space app [via libc] triggers __NR_arc_usr_cmpxchg syscall, > we enter arc_usr_cmpxchg()which basically does: > ---------------------------->8------------------------------- > preempt_disable(); > __get_user(uval, uaddr); > __put_user(new, uaddr); > preempt_enable(); > ---------------------------->8------------------------------- > > 2. Most of the time everything is fine because __get_user()/__put_user() > for ARC is just LD/ST. > > 3. Rarely user's variable is situated in not yet allocated page. > Here I mean copy-on-write case, when a page has read-only flag in TLB. > In that case __get_user() succeeds but __put_user() causes Privilege > Violation exception and we enter do_page_fault() where new page allocation > with proper access bits is supposed to happen... but that never happens > because with preempt_disable() we set in_atomic() which set > faulthandler_disabled() and so we exit early from page fault handler > effectively with nothing done, i.e. user's variable is left unchanged > which in its turn causes very strange problems later down the line because > we don't notify user-space app about failed data modification. Interesting problem ! But what is special here, I would think syscalls in general could hit this. > > The simplest fix is to not mess with preemption: > ---------------------------->8------------------------------- > diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c > index 5ac3b547453f..d1713d8d3981 100644 > --- a/arch/arc/kernel/process.c > +++ b/arch/arc/kernel/process.c > @@ -63,8 +63,6 @@ SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new) > if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int))) > return -EFAULT; > > - preempt_disable(); > - > if (__get_user(uval, uaddr)) > goto done; ... ... > done: > - preempt_enable(); > - > return uval; > } > ---------------------------->8------------------------------- > > But I'm not really sure how safe is that. Well it is broken wrt the semantics the syscall is supposed to provide. Preemption disabling is what prevents a concurrent thread from coming in and modifying the same location (Imagine a variable which is being cmpxchg concurrently by 2 threads). One approach is to do it the MIPS way, emulate the llsc flag - set it under preemption disabled section and clear it in switch_to see arch/mips/kernel/syscall.c
On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote: > Well it is broken wrt the semantics the syscall is supposed to provide. > Preemption disabling is what prevents a concurrent thread from coming in and > modifying the same location (Imagine a variable which is being cmpxchg > concurrently by 2 threads). > > One approach is to do it the MIPS way, emulate the llsc flag - set it under > preemption disabled section and clear it in switch_to *shudder*... just catch the -EFAULT, force the write fault and retry. Something like: int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new) { u32 val; int ret; again: ret = 0; preempt_disable(); val = get_user(user_ptr); if (val == old) ret = put_user(new, user_ptr); preempt_enable(); if (ret == -EFAULT) { struct page *page; ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page); if (ret < 0) return ret; put_page(page); goto again; } return ret; }
On Wed, Mar 14, 2018 at 06:53:52PM +0100, Peter Zijlstra wrote: > On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote: > > > Well it is broken wrt the semantics the syscall is supposed to provide. > > Preemption disabling is what prevents a concurrent thread from coming in and > > modifying the same location (Imagine a variable which is being cmpxchg > > concurrently by 2 threads). > > > > One approach is to do it the MIPS way, emulate the llsc flag - set it under > > preemption disabled section and clear it in switch_to > > *shudder*... just catch the -EFAULT, force the write fault and retry. > > Something like: > > int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new) > { > u32 val; > int ret; Also very important: if ((unsigned long)user_ptr & 0x3) return -EINVAL; You must disallow unaligned atomics, otherwise the below can cross a page-boundary (and unaligned atomics are insane in any case). > again: > ret = 0; > > preempt_disable(); > val = get_user(user_ptr); > if (val == old) > ret = put_user(new, user_ptr); > preempt_enable(); > > if (ret == -EFAULT) { > struct page *page; > ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page); > if (ret < 0) > return ret; > put_page(page); > goto again; > } > > return ret; > }
Hi Peter, Vineet, On Wed, 2018-03-14 at 18:53 +0100, Peter Zijlstra wrote: > On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote: > > > Well it is broken wrt the semantics the syscall is supposed to provide. > > Preemption disabling is what prevents a concurrent thread from coming in and > > modifying the same location (Imagine a variable which is being cmpxchg > > concurrently by 2 threads). > > > > One approach is to do it the MIPS way, emulate the llsc flag - set it under > > preemption disabled section and clear it in switch_to > > *shudder*... just catch the -EFAULT, force the write fault and retry. > > Something like: > > int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new) > { > u32 val; > int ret; > > again: > ret = 0; > > preempt_disable(); > val = get_user(user_ptr); > if (val == old) > ret = put_user(new, user_ptr); > preempt_enable(); > > if (ret == -EFAULT) { > struct page *page; > ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page); > if (ret < 0) > return ret; > put_page(page); > goto again; I guess this jump we need to do only once, right? If for whatever reason get_user_pages_fast() fails we return immediately and if it succeeds there's no reason for put_user() to not succeed as required page is supposed to be prepared for write. Otherwise if something goes way too bad we may end-up in an infinite loop which we'd better prevent. > } > > return ret; > } @Vineet, are you OK with proposed implementation? -Alexey
On 03/14/2018 01:38 PM, Alexey Brodkin wrote:
> @Vineet, are you OK with proposed implementation?
I couldn't agree any more !
-Vineet
On Wed, Mar 14, 2018 at 08:38:53PM +0000, Alexey Brodkin wrote: > > int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new) > > { > > u32 val; > > int ret; > > > > again: > > ret = 0; > > > > preempt_disable(); > > val = get_user(user_ptr); > > if (val == old) > > ret = put_user(new, user_ptr); > > preempt_enable(); > > > > if (ret == -EFAULT) { > > struct page *page; > > ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page); > > if (ret < 0) > > return ret; > > put_page(page); > > goto again; > > I guess this jump we need to do only once, right? Typically, yes. It is theoretically possible for the page to get paged-out right after we do put_page() and before we do get/put_user(), But if that happens the machine likely has bigger problems than having to do this loop again. FWIW, look at kernel/futex.c for working examples of this pattern, the above was written purely from memory and could contain a fail or two ;-) Also, it might make sense to stuff this implementation in some lib/ file somewhere and make all platforms that need it use the same code, afaict there really isn't anything platform specific to it.
Hi Peter, On Thu, 2018-03-15 at 09:18 +0100, Peter Zijlstra wrote: > On Wed, Mar 14, 2018 at 08:38:53PM +0000, Alexey Brodkin wrote: > > > int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new) > > > { > > > u32 val; > > > int ret; > > > > > > again: > > > ret = 0; > > > > > > preempt_disable(); > > > val = get_user(user_ptr); > > > if (val == old) > > > ret = put_user(new, user_ptr); > > > preempt_enable(); > > > > > > if (ret == -EFAULT) { > > > struct page *page; > > > ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page); > > > if (ret < 0) > > > return ret; > > > put_page(page); > > > goto again; > > > > I guess this jump we need to do only once, right? > > Typically, yes. It is theoretically possible for the page to get > paged-out right after we do put_page() and before we do get/put_user(), > But if that happens the machine likely has bigger problems than having > to do this loop again. > > FWIW, look at kernel/futex.c for working examples of this pattern, the > above was written purely from memory and could contain a fail or two ;-) Thanks for the pointer! > Also, it might make sense to stuff this implementation in some lib/ file > somewhere and make all platforms that need it use the same code, afaict > there really isn't anything platform specific to it. Not clear which part do you mean here. Are you talking about entire cmpxchg syscall implementation? Do you think there're many users of that quite an inefficient [compared to proper HW version] atomic exchange? -Alexey
On Thu, Mar 15, 2018 at 09:12:09AM +0000, Alexey Brodkin wrote: > On Thu, 2018-03-15 at 09:18 +0100, Peter Zijlstra wrote: > > Also, it might make sense to stuff this implementation in some lib/ file > > somewhere and make all platforms that need it use the same code, afaict > > there really isn't anything platform specific to it. > > Not clear which part do you mean here. > Are you talking about entire cmpxchg syscall implementation? Yep. > Do you think there're many users of that quite an inefficient > [compared to proper HW version] atomic exchange? I think there's a bunch of architectures that are in the same boat. m68k, arm, mips was mentioned. Sure, the moment an arch has hardware support you don't need the syscall anymore. I was just thinking it would be good to have a common implementation (if possible) rather than 4-5 different copies of basically the same thing.
Hi Peter, On Thu, 2018-03-15 at 12:28 +0100, Peter Zijlstra wrote: > On Thu, Mar 15, 2018 at 09:12:09AM +0000, Alexey Brodkin wrote: > > On Thu, 2018-03-15 at 09:18 +0100, Peter Zijlstra wrote: > > > Also, it might make sense to stuff this implementation in some lib/ file > > > somewhere and make all platforms that need it use the same code, afaict > > > there really isn't anything platform specific to it. > > > > Not clear which part do you mean here. > > Are you talking about entire cmpxchg syscall implementation? > > Yep. Hm... new generic syscall doing something sane people are not supposed to do? Let's see who's going to express his/her excitement about that :) But even introduction of that new syscall is obviously not enough as we'll need to fix-up libc for affected arches accordingly... > > Do you think there're many users of that quite an inefficient > > [compared to proper HW version] atomic exchange? > > I think there's a bunch of architectures that are in the same boat. > m68k, arm, mips was mentioned. Sure, the moment an arch has hardware > support you don't need the syscall anymore. Here's a brief analysis: ARM: Looks like they got rid of that stuff in v4.4, see commit db695c0509d6 ("ARM: remove user cmpxchg syscall"). M68K: That's even uglier implementation which is really asking for a facelift, look at sys_atomic_cmpxchg_32() here: https://elixir.bootlin.com/linux/latest/source/arch/m68k/kernel/sys_m68k.c#L461 MIPS: They do it via special sysmips syscall which among other things might handle MIPS_ATOMIC_SET with mips_atomic_set() I don't immediately see if there're others but really I'm not sure if it even worth trying to clean-up all that since efforts might be spent pointlessly. > I was just thinking it would be good to have a common implementation (if > possible) rather than 4-5 different copies of basically the same thing. From above I would conclude that only M68K might benefit from new library implementation. BTW M68K uses a bit different ABI compared to ARC for that syscall so it will be really atomic_cmpxchg_32() libfunction called from those syscalls, but now I think that's exactly what you meant initially, correct? -Alexey
On Thu, Mar 15, 2018 at 07:03:32PM +0000, Alexey Brodkin wrote: > > I think there's a bunch of architectures that are in the same boat. > > m68k, arm, mips was mentioned. Sure, the moment an arch has hardware > > support you don't need the syscall anymore. > > Here's a brief analysis: > ARM: Looks like they got rid of that stuff in v4.4, see > commit db695c0509d6 ("ARM: remove user cmpxchg syscall"). Oh shiny, that's why I couldn't find it. I had distinct memories of them having one though. > M68K: That's even uglier implementation which is really asking for > a facelift, look at sys_atomic_cmpxchg_32() here: > https://elixir.bootlin.com/linux/latest/source/arch/m68k/kernel/sys_m68k.c#L461 Yes, I found that code, it's something special allright. > MIPS: They do it via special sysmips syscall which among other things > might handle MIPS_ATOMIC_SET with mips_atomic_set() > > I don't immediately see if there're others but really I'm not sure if it even worth trying to > clean-up all that since efforts might be spent pointlessly. > > > I was just thinking it would be good to have a common implementation (if > > possible) rather than 4-5 different copies of basically the same thing. > > From above I would conclude that only M68K might benefit from new > library implementation. BTW M68K uses a bit different ABI compared to > ARC for that syscall so it will be really atomic_cmpxchg_32() > libfunction called from those syscalls, but now I think that's exactly > what you meant initially, correct? Right. In any case, I won't insist, but if it's very little effort, it might well be worth getting rid of that m68k magic.
Hi Peter, Vineet, On Wed, 2018-03-14 at 18:53 +0100, Peter Zijlstra wrote: > On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote: > > > Well it is broken wrt the semantics the syscall is supposed to provide. > > Preemption disabling is what prevents a concurrent thread from coming in and > > modifying the same location (Imagine a variable which is being cmpxchg > > concurrently by 2 threads). > > > > One approach is to do it the MIPS way, emulate the llsc flag - set it under > > preemption disabled section and clear it in switch_to > > *shudder*... just catch the -EFAULT, force the write fault and retry. More I look at this initially quite simple thing more it looks like a can of worms... > Something like: > > int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new) > { That functions is supposed to return old value stored in memory. At least that's how it is used in case of ARC and M68K. Remember there's already libc that relies on that established API and we cannot just change it... even though it might be a good idea. For example return "errno" and pass old value via pointer in an argument. But now I guess it's better to use what we have now. > u32 val; > int ret; > > again: > ret = 0; > > preempt_disable(); > val = get_user(user_ptr); What if get_user() fails? In Peter's implementation we will return 0, in Vineet's we will return -EFAULT... and who knows what kind of unexpected behavior happens further down the line in user-space... so I think it would be safer to kill the process then. And that's my take: -------------------------->8------------------------ int sys_cmpxchg(u32 __user *user_ptr, u32 old, u32 new) { u32 val; int ret; again: ret = 0; preempt_disable(); ret = get_user(val, user_ptr); if(ret == -EFAULT) { struct page *page; preempt_enable(); ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page); if (ret < 0) { force_sig(SIGSEGV, current); return ret; } put_page(page); goto again; } if (val == old) ret = put_user(new, user_ptr); preempt_enable(); if (ret == -EFAULT) { struct page *page; ret = get_user_pages_fast((unsigned long)user_ptr, 1, 1, &page); if (ret < 0) { force_sig(SIGSEGV, current); return ret; } put_page(page); goto again; } return ret; } -------------------------->8------------------------ -Alexey
On 03/16/2018 10:33 AM, Alexey Brodkin wrote: > Hi Peter, Vineet, > > On Wed, 2018-03-14 at 18:53 +0100, Peter Zijlstra wrote: >> On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote: >> >>> Well it is broken wrt the semantics the syscall is supposed to provide. >>> Preemption disabling is what prevents a concurrent thread from coming in and >>> modifying the same location (Imagine a variable which is being cmpxchg >>> concurrently by 2 threads). >>> >>> One approach is to do it the MIPS way, emulate the llsc flag - set it under >>> preemption disabled section and clear it in switch_to >> *shudder*... just catch the -EFAULT, force the write fault and retry. > More I look at this initially quite simple thing more it looks like > a can of worms... > I'd say just bite the bullet, write the patch and we can refine it there ! -Vineet
On Fri, Mar 16, 2018 at 10:54:52AM -0700, Vineet Gupta wrote:
> I'd say just bite the bullet, write the patch and we can refine it there !
Just be glad its not futex.c proper ;-) I'll try and have a look later..
> On Thu, Mar 15, 2018 at 12:03 PM, Alexey Brodkin <Alexey.Brodkin@synopsys.com> wrote: > Here's a brief analysis: > ARM: Looks like they got rid of that stuff in v4.4, see > commit db695c0509d6 ("ARM: remove user cmpxchg syscall"). > > M68K: That's even uglier implementation which is really asking for > a facelift, look at sys_atomic_cmpxchg_32() here: > https://elixir.bootlin.com/linux/latest/source/arch/m68k/kernel/sys_m68k.c#L461 > > MIPS: They do it via special sysmips syscall which among other things > might handle MIPS_ATOMIC_SET with mips_atomic_set() > > I don't immediately see if there're others but really I'm not sure if it even worth trying to > clean-up all that since efforts might be spent pointlessly. xtensa is another one. We used to have a buggy implementation in arch/xtensa/kernel/entry.S:fast_syscall_xtensa which we still keep disabled by default, just in case somebody wanted backwards compatibility. I don't think it's worth fixing.
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c index 5ac3b547453f..d1713d8d3981 100644 --- a/arch/arc/kernel/process.c +++ b/arch/arc/kernel/process.c @@ -63,8 +63,6 @@ SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new) if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int))) return -EFAULT; - preempt_disable(); - if (__get_user(uval, uaddr)) goto done; @@ -74,8 +72,6 @@ SYSCALL_DEFINE3(arc_usr_cmpxchg, int *, uaddr, int, expected, int, new) } done: - preempt_enable(); - return uval; } ---------------------------->8-------------------------------