diff mbox series

um: add RCU syscall hack for time-travel

Message ID 20240830153825.1466691-1-benjamin@sipsolutions.net
State Changes Requested
Headers show
Series um: add RCU syscall hack for time-travel | expand

Commit Message

Benjamin Berg Aug. 30, 2024, 3:38 p.m. UTC
From: Benjamin Berg <benjamin.berg@intel.com>

In time-travel mode userspace can do a lot of work without any time
passing. Unfortunately, this can result in OOM situations as the RCU
core code will never be run.

Work around that by kicking the RCU using rcu_sched_clock_irq. So
behave to the RCU code as if a clock tick happened every syscall.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>

---

This patch is on top of "um: fix time-travel syscall scheduling hack"
---
 arch/um/kernel/skas/syscall.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Richard Weinberger Sept. 12, 2024, 7:02 p.m. UTC | #1
On Fri, Aug 30, 2024 at 5:38 PM Benjamin Berg <benjamin@sipsolutions.net> wrote:
>
> From: Benjamin Berg <benjamin.berg@intel.com>
>
> In time-travel mode userspace can do a lot of work without any time
> passing. Unfortunately, this can result in OOM situations as the RCU
> core code will never be run.
>
> Work around that by kicking the RCU using rcu_sched_clock_irq. So
> behave to the RCU code as if a clock tick happened every syscall.
>
> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
>
> ---
>
> This patch is on top of "um: fix time-travel syscall scheduling hack"
> ---
>  arch/um/kernel/skas/syscall.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/arch/um/kernel/skas/syscall.c b/arch/um/kernel/skas/syscall.c
> index b09e85279d2b..4b4ab8bf8a0c 100644
> --- a/arch/um/kernel/skas/syscall.c
> +++ b/arch/um/kernel/skas/syscall.c
> @@ -19,6 +19,21 @@ void handle_syscall(struct uml_pt_regs *r)
>         struct pt_regs *regs = container_of(r, struct pt_regs, regs);
>         int syscall;
>
> +       /*
> +        * This is a "bit" of a hack. But in time-travel mode userspace can do
> +        * a lot of work without any time passing. Unfortunately, this can
> +        * result in OOM situations as the RCU core code will never be run.
> +        *
> +        * Work around that by kicking the RCU using rcu_sched_clock_irq. So
> +        * behave to the RCU code as if a clock tick happened every syscall.
> +        */
> +       if (time_travel_mode == TT_MODE_INFCPU ||
> +           time_travel_mode == TT_MODE_EXTERNAL) {
> +               local_irq_disable();
> +               rcu_sched_clock_irq(1);
> +               local_irq_enable();
> +       }
> +

While I acknowledge that time-travel itself is a beautiful hack, I'd
like to keep the hacks
to keep it working minimal.
So, the problem here is that RCU callbacks never run and just pile up?

I wonder why such a situation does not happen in a nohz_full setup on
regular systems.
Benjamin Berg Sept. 13, 2024, 10:50 a.m. UTC | #2
Hi,

On Thu, 2024-09-12 at 21:02 +0200, Richard Weinberger wrote:
> On Fri, Aug 30, 2024 at 5:38 PM Benjamin Berg
> <benjamin@sipsolutions.net> wrote:
> > 
> > From: Benjamin Berg <benjamin.berg@intel.com>
> > 
> > In time-travel mode userspace can do a lot of work without any time
> > passing. Unfortunately, this can result in OOM situations as the
> > RCU
> > core code will never be run.
> > 
> > Work around that by kicking the RCU using rcu_sched_clock_irq. So
> > behave to the RCU code as if a clock tick happened every syscall.
> > 
> > Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
> > 
> > [SNIP]
> 
> While I acknowledge that time-travel itself is a beautiful hack, I'd
> like to keep the hacks
> to keep it working minimal.
> So, the problem here is that RCU callbacks never run and just pile up?

Yes. A simple example of this is doing a "find /". This will allocate a
lot of inode information which is only free'ed at a later point.

> I wonder why such a situation does not happen in a nohz_full setup on
> regular systems.

Had to search for a bit. But, I think the boot CPU will still have a
tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter.

It does look like the RCU code might try to force scheduling (tiny RCU)
or wake up a worker (tree RCU) in these situations. But neither of
these attempts is going to fix the situation as there will be no call
to rcu_sched_clock_irq with time-travel.

Benjamin
Richard Weinberger Sept. 13, 2024, 11:47 a.m. UTC | #3
----- Ursprüngliche Mail -----
> Von: "Benjamin Berg" <benjamin@sipsolutions.net>
>> While I acknowledge that time-travel itself is a beautiful hack, I'd
>> like to keep the hacks
>> to keep it working minimal.
>> So, the problem here is that RCU callbacks never run and just pile up?
> 
> Yes. A simple example of this is doing a "find /". This will allocate a
> lot of inode information which is only free'ed at a later point.
> 
>> I wonder why such a situation does not happen in a nohz_full setup on
>> regular systems.
> 
> Had to search for a bit. But, I think the boot CPU will still have a
> tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter.
> 
> It does look like the RCU code might try to force scheduling (tiny RCU)
> or wake up a worker (tree RCU) in these situations. But neither of
> these attempts is going to fix the situation as there will be no call
> to rcu_sched_clock_irq with time-travel.

Agreed. I think having a house keeping CPU (thread) will not work in
time-travel mode.
Kicking RCU whenever a syscall is executed is okay, the question is,
are there other scenarios where RCU work can pile up and no syscall is
run for a long time? Maybe we need to kick it at other places (page fault handler?)
too.

Thanks,
//richard
Benjamin Berg Sept. 13, 2024, 12:04 p.m. UTC | #4
Hi

First, it doesn't seem like my patch actually works, so please do not
merge it. It actually appears that tree RCU and tiny RCU (which are
selected depending on the preemption setting) are behaving differently.

So now I am wondering if I can come up with a hack that works for both.

On Fri, 2024-09-13 at 13:47 +0200, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
> > Von: "Benjamin Berg" <benjamin@sipsolutions.net>
> > > While I acknowledge that time-travel itself is a beautiful hack, I'd
> > > like to keep the hacks
> > > to keep it working minimal.
> > > So, the problem here is that RCU callbacks never run and just pile up?
> > 
> > Yes. A simple example of this is doing a "find /". This will allocate a
> > lot of inode information which is only free'ed at a later point.
> > 
> > > I wonder why such a situation does not happen in a nohz_full setup on
> > > regular systems.
> > 
> > Had to search for a bit. But, I think the boot CPU will still have a
> > tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter.
> > 
> > It does look like the RCU code might try to force scheduling (tiny RCU)
> > or wake up a worker (tree RCU) in these situations. But neither of
> > these attempts is going to fix the situation as there will be no call
> > to rcu_sched_clock_irq with time-travel.
> 
> Agreed. I think having a house keeping CPU (thread) will not work in
> time-travel mode.
> Kicking RCU whenever a syscall is executed is okay, the question is,
> are there other scenarios where RCU work can pile up and no syscall is
> run for a long time? Maybe we need to kick it at other places (page fault handler?)
> too.

Hmm, that is good question. I assume that implies major faults for
mapped files (or anonymous memory from swap) happening. I suppose, that
can trigger just about anything in the kernel and could also create
load on the RCU. Not sure how problematic that is, in our case it was
python importing a large amount of files and bringing the system to its
knees in the process.

Anyway, I'll need to reconsider the hack a bit, maybe we can find a
better solution.

Benjamin
Richard Weinberger Sept. 13, 2024, 12:32 p.m. UTC | #5
Hi!

----- Ursprüngliche Mail -----
> Von: "Benjamin Berg" <benjamin@sipsolutions.net>
> First, it doesn't seem like my patch actually works, so please do not
> merge it. It actually appears that tree RCU and tiny RCU (which are
> selected depending on the preemption setting) are behaving differently.
> 
> So now I am wondering if I can come up with a hack that works for both.

Ok!
 
> On Fri, 2024-09-13 at 13:47 +0200, Richard Weinberger wrote:
>> ----- Ursprüngliche Mail -----
>> > Von: "Benjamin Berg" <benjamin@sipsolutions.net>
>> > > While I acknowledge that time-travel itself is a beautiful hack, I'd
>> > > like to keep the hacks
>> > > to keep it working minimal.
>> > > So, the problem here is that RCU callbacks never run and just pile up?
>> > 
>> > Yes. A simple example of this is doing a "find /". This will allocate a
>> > lot of inode information which is only free'ed at a later point.
>> > 
>> > > I wonder why such a situation does not happen in a nohz_full setup on
>> > > regular systems.
>> > 
>> > Had to search for a bit. But, I think the boot CPU will still have a
>> > tick even on a NOHZ_FULL setup. see the nohz_full= boot parameter.
>> > 
>> > It does look like the RCU code might try to force scheduling (tiny RCU)
>> > or wake up a worker (tree RCU) in these situations. But neither of
>> > these attempts is going to fix the situation as there will be no call
>> > to rcu_sched_clock_irq with time-travel.
>> 
>> Agreed. I think having a house keeping CPU (thread) will not work in
>> time-travel mode.
>> Kicking RCU whenever a syscall is executed is okay, the question is,
>> are there other scenarios where RCU work can pile up and no syscall is
>> run for a long time? Maybe we need to kick it at other places (page fault
>> handler?)
>> too.
> 
> Hmm, that is good question. I assume that implies major faults for
> mapped files (or anonymous memory from swap) happening. I suppose, that
> can trigger just about anything in the kernel and could also create
> load on the RCU. Not sure how problematic that is, in our case it was
> python importing a large amount of files and bringing the system to its
> knees in the process.

I had also workloads like heavy network processing without userspace
interaction in mind.
 
> Anyway, I'll need to reconsider the hack a bit, maybe we can find a
> better solution.

We can also add RCU folks into the loop. But I guess they need a good
introduction first what time-traveling is. :-D

Thanks,
//richard
diff mbox series

Patch

diff --git a/arch/um/kernel/skas/syscall.c b/arch/um/kernel/skas/syscall.c
index b09e85279d2b..4b4ab8bf8a0c 100644
--- a/arch/um/kernel/skas/syscall.c
+++ b/arch/um/kernel/skas/syscall.c
@@ -19,6 +19,21 @@  void handle_syscall(struct uml_pt_regs *r)
 	struct pt_regs *regs = container_of(r, struct pt_regs, regs);
 	int syscall;
 
+	/*
+	 * This is a "bit" of a hack. But in time-travel mode userspace can do
+	 * a lot of work without any time passing. Unfortunately, this can
+	 * result in OOM situations as the RCU core code will never be run.
+	 *
+	 * Work around that by kicking the RCU using rcu_sched_clock_irq. So
+	 * behave to the RCU code as if a clock tick happened every syscall.
+	 */
+	if (time_travel_mode == TT_MODE_INFCPU ||
+	    time_travel_mode == TT_MODE_EXTERNAL) {
+		local_irq_disable();
+		rcu_sched_clock_irq(1);
+		local_irq_enable();
+	}
+
 	/* Initialize the syscall number and default return value. */
 	UPT_SYSCALL_NR(r) = PT_SYSCALL_NR(r->gp);
 	PT_REGS_SET_SYSCALL_RETURN(regs, -ENOSYS);