Message ID | 20240830153825.1466691-1-benjamin@sipsolutions.net |
---|---|
State | Changes Requested |
Headers | show |
Series | um: add RCU syscall hack for time-travel | expand |
On Fri, Aug 30, 2024 at 5:38 PM Benjamin Berg <benjamin@sipsolutions.net> wrote: > > From: Benjamin Berg <benjamin.berg@intel.com> > > In time-travel mode userspace can do a lot of work without any time > passing. Unfortunately, this can result in OOM situations as the RCU > core code will never be run. > > Work around that by kicking the RCU using rcu_sched_clock_irq. So > behave to the RCU code as if a clock tick happened every syscall. > > Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> > > --- > > This patch is on top of "um: fix time-travel syscall scheduling hack" > --- > arch/um/kernel/skas/syscall.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/arch/um/kernel/skas/syscall.c b/arch/um/kernel/skas/syscall.c > index b09e85279d2b..4b4ab8bf8a0c 100644 > --- a/arch/um/kernel/skas/syscall.c > +++ b/arch/um/kernel/skas/syscall.c > @@ -19,6 +19,21 @@ void handle_syscall(struct uml_pt_regs *r) > struct pt_regs *regs = container_of(r, struct pt_regs, regs); > int syscall; > > + /* > + * This is a "bit" of a hack. But in time-travel mode userspace can do > + * a lot of work without any time passing. Unfortunately, this can > + * result in OOM situations as the RCU core code will never be run. > + * > + * Work around that by kicking the RCU using rcu_sched_clock_irq. So > + * behave to the RCU code as if a clock tick happened every syscall. > + */ > + if (time_travel_mode == TT_MODE_INFCPU || > + time_travel_mode == TT_MODE_EXTERNAL) { > + local_irq_disable(); > + rcu_sched_clock_irq(1); > + local_irq_enable(); > + } > + While I acknowledge that time-travel itself is a beautiful hack, I'd like to keep the hacks to keep it working minimal. So, the problem here is that RCU callbacks never run and just pile up? I wonder why such a situation does not happen in a nohz_full setup on regular systems.
Hi, On Thu, 2024-09-12 at 21:02 +0200, Richard Weinberger wrote: > On Fri, Aug 30, 2024 at 5:38 PM Benjamin Berg > <benjamin@sipsolutions.net> wrote: > > > > From: Benjamin Berg <benjamin.berg@intel.com> > > > > In time-travel mode userspace can do a lot of work without any time > > passing. Unfortunately, this can result in OOM situations as the > > RCU > > core code will never be run. > > > > Work around that by kicking the RCU using rcu_sched_clock_irq. So > > behave to the RCU code as if a clock tick happened every syscall. > > > > Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> > > > > [SNIP] > > While I acknowledge that time-travel itself is a beautiful hack, I'd > like to keep the hacks > to keep it working minimal. > So, the problem here is that RCU callbacks never run and just pile up? Yes. A simple example of this is doing a "find /". This will allocate a lot of inode information which is only free'ed at a later point. > I wonder why such a situation does not happen in a nohz_full setup on > regular systems. Had to search for a bit. But, I think the boot CPU will still have a tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter. It does look like the RCU code might try to force scheduling (tiny RCU) or wake up a worker (tree RCU) in these situations. But neither of these attempts is going to fix the situation as there will be no call to rcu_sched_clock_irq with time-travel. Benjamin
----- Ursprüngliche Mail ----- > Von: "Benjamin Berg" <benjamin@sipsolutions.net> >> While I acknowledge that time-travel itself is a beautiful hack, I'd >> like to keep the hacks >> to keep it working minimal. >> So, the problem here is that RCU callbacks never run and just pile up? > > Yes. A simple example of this is doing a "find /". This will allocate a > lot of inode information which is only free'ed at a later point. > >> I wonder why such a situation does not happen in a nohz_full setup on >> regular systems. > > Had to search for a bit. But, I think the boot CPU will still have a > tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter. > > It does look like the RCU code might try to force scheduling (tiny RCU) > or wake up a worker (tree RCU) in these situations. But neither of > these attempts is going to fix the situation as there will be no call > to rcu_sched_clock_irq with time-travel. Agreed. I think having a house keeping CPU (thread) will not work in time-travel mode. Kicking RCU whenever a syscall is executed is okay, the question is, are there other scenarios where RCU work can pile up and no syscall is run for a long time? Maybe we need to kick it at other places (page fault handler?) too. Thanks, //richard
Hi First, it doesn't seem like my patch actually works, so please do not merge it. It actually appears that tree RCU and tiny RCU (which are selected depending on the preemption setting) are behaving differently. So now I am wondering if I can come up with a hack that works for both. On Fri, 2024-09-13 at 13:47 +0200, Richard Weinberger wrote: > ----- Ursprüngliche Mail ----- > > Von: "Benjamin Berg" <benjamin@sipsolutions.net> > > > While I acknowledge that time-travel itself is a beautiful hack, I'd > > > like to keep the hacks > > > to keep it working minimal. > > > So, the problem here is that RCU callbacks never run and just pile up? > > > > Yes. A simple example of this is doing a "find /". This will allocate a > > lot of inode information which is only free'ed at a later point. > > > > > I wonder why such a situation does not happen in a nohz_full setup on > > > regular systems. > > > > Had to search for a bit. But, I think the boot CPU will still have a > > tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter. > > > > It does look like the RCU code might try to force scheduling (tiny RCU) > > or wake up a worker (tree RCU) in these situations. But neither of > > these attempts is going to fix the situation as there will be no call > > to rcu_sched_clock_irq with time-travel. > > Agreed. I think having a house keeping CPU (thread) will not work in > time-travel mode. > Kicking RCU whenever a syscall is executed is okay, the question is, > are there other scenarios where RCU work can pile up and no syscall is > run for a long time? Maybe we need to kick it at other places (page fault handler?) > too. Hmm, that is good question. I assume that implies major faults for mapped files (or anonymous memory from swap) happening. I suppose, that can trigger just about anything in the kernel and could also create load on the RCU. Not sure how problematic that is, in our case it was python importing a large amount of files and bringing the system to its knees in the process. Anyway, I'll need to reconsider the hack a bit, maybe we can find a better solution. Benjamin
Hi! ----- Ursprüngliche Mail ----- > Von: "Benjamin Berg" <benjamin@sipsolutions.net> > First, it doesn't seem like my patch actually works, so please do not > merge it. It actually appears that tree RCU and tiny RCU (which are > selected depending on the preemption setting) are behaving differently. > > So now I am wondering if I can come up with a hack that works for both. Ok! > On Fri, 2024-09-13 at 13:47 +0200, Richard Weinberger wrote: >> ----- Ursprüngliche Mail ----- >> > Von: "Benjamin Berg" <benjamin@sipsolutions.net> >> > > While I acknowledge that time-travel itself is a beautiful hack, I'd >> > > like to keep the hacks >> > > to keep it working minimal. >> > > So, the problem here is that RCU callbacks never run and just pile up? >> > >> > Yes. A simple example of this is doing a "find /". This will allocate a >> > lot of inode information which is only free'ed at a later point. >> > >> > > I wonder why such a situation does not happen in a nohz_full setup on >> > > regular systems. >> > >> > Had to search for a bit. But, I think the boot CPU will still have a >> > tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter. >> > >> > It does look like the RCU code might try to force scheduling (tiny RCU) >> > or wake up a worker (tree RCU) in these situations. But neither of >> > these attempts is going to fix the situation as there will be no call >> > to rcu_sched_clock_irq with time-travel. >> >> Agreed. I think having a house keeping CPU (thread) will not work in >> time-travel mode. >> Kicking RCU whenever a syscall is executed is okay, the question is, >> are there other scenarios where RCU work can pile up and no syscall is >> run for a long time? Maybe we need to kick it at other places (page fault >> handler?) >> too. > > Hmm, that is good question. I assume that implies major faults for > mapped files (or anonymous memory from swap) happening. I suppose, that > can trigger just about anything in the kernel and could also create > load on the RCU. Not sure how problematic that is, in our case it was > python importing a large amount of files and bringing the system to its > knees in the process. I had also workloads like heavy network processing without userspace interaction in mind. > Anyway, I'll need to reconsider the hack a bit, maybe we can find a > better solution. We can also add RCU folks into the loop. But I guess they need a good introduction first what time-traveling is. :-D Thanks, //richard
diff --git a/arch/um/kernel/skas/syscall.c b/arch/um/kernel/skas/syscall.c index b09e85279d2b..4b4ab8bf8a0c 100644 --- a/arch/um/kernel/skas/syscall.c +++ b/arch/um/kernel/skas/syscall.c @@ -19,6 +19,21 @@ void handle_syscall(struct uml_pt_regs *r) struct pt_regs *regs = container_of(r, struct pt_regs, regs); int syscall; + /* + * This is a "bit" of a hack. But in time-travel mode userspace can do + * a lot of work without any time passing. Unfortunately, this can + * result in OOM situations as the RCU core code will never be run. + * + * Work around that by kicking the RCU using rcu_sched_clock_irq. So + * behave to the RCU code as if a clock tick happened every syscall. + */ + if (time_travel_mode == TT_MODE_INFCPU || + time_travel_mode == TT_MODE_EXTERNAL) { + local_irq_disable(); + rcu_sched_clock_irq(1); + local_irq_enable(); + } + /* Initialize the syscall number and default return value. */ UPT_SYSCALL_NR(r) = PT_SYSCALL_NR(r->gp); PT_REGS_SET_SYSCALL_RETURN(regs, -ENOSYS);