diff mbox series

[v4,10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

Message ID 5acf7502c071c0d1365ba5e5940e003a7da6521f.camel@marvell.com
State Not Applicable
Delegated to: David Miller
Headers show
Series "Task_isolation" mode | expand

Commit Message

Alex Belits July 22, 2020, 2:57 p.m. UTC
From: Yuri Norov <ynorov@marvell.com>

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Frederic Weisbecker Oct. 1, 2020, 2:44 p.m. UTC | #1
On Wed, Jul 22, 2020 at 02:57:33PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
> 
> This patch adds check for it.
> 
> Signed-off-by: Yuri Norov <ynorov@marvell.com>
> [abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/time/tick-sched.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 6e4cd8459f05..2f82a6daf8fc 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/clock.h>
>  #include <linux/sched/stat.h>
>  #include <linux/sched/nohz.h>
> +#include <linux/isolation.h>
>  #include <linux/module.h>
>  #include <linux/irq_work.h>
>  #include <linux/posix-timers.h>
> @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> -	if (!tick_nohz_full_cpu(cpu))
> +	smp_rmb();

What is it ordering?

> +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>  		return;

You can't simply ignore an IPI. There is always a reason for a nohz_full CPU
to be kicked. Something triggered a tick dependency. It can be posix cpu timers
for example, or anything.


>  
>  	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> -- 
> 2.26.2
>
Alex Belits Oct. 4, 2020, 3:22 p.m. UTC | #2
On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Wed, Jul 22, 2020 at 02:57:33PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > [abelits@marvell.com: updated, only exclude CPUs running isolated
> > tasks]
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  kernel/time/tick-sched.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 6e4cd8459f05..2f82a6daf8fc 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/clock.h>
> >  #include <linux/sched/stat.h>
> >  #include <linux/sched/nohz.h>
> > +#include <linux/isolation.h>
> >  #include <linux/module.h>
> >  #include <linux/irq_work.h>
> >  #include <linux/posix-timers.h>
> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -	if (!tick_nohz_full_cpu(cpu))
> > +	smp_rmb();
> 
> What is it ordering?

ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
should be ordered against writing in
task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
and task_isolation_start().

Since task_isolation_on_cpu() is often called for multiple CPUs in a
sequence, it would be wasteful to include a barrier inside it.

> > +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> >  		return;
> 
> You can't simply ignore an IPI. There is always a reason for a
> nohz_full CPU
> to be kicked. Something triggered a tick dependency. It can be posix
> cpu timers
> for example, or anything.

I realize that this is unusual, however the idea is that while the task
is running in isolated mode in userspace, we assume that from this CPUs
point of view whatever is happening in kernel, can wait until CPU is
back in kernel, and when it first enters kernel from this mode, it
should "catch up" with everything that happened in its absence.
task_isolation_kernel_enter() is supposed to do that, so by the time
anything should be done involving the rest of the kernel, CPU is back
to normal.

It is application's responsibility to avoid triggering things that
break its isolation, so the application assumes that everything that
involves entering kernel will not be available while it is isolated. If
isolation will be broken, or application will request return from
isolation, everything will go back to normal environment with all
functionality available.

> >  
> >  	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> > -- 
> > 2.26.2
> >
Frederic Weisbecker Oct. 6, 2020, 9:41 p.m. UTC | #3
On Sun, Oct 04, 2020 at 03:22:09PM +0000, Alex Belits wrote:
> 
> On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > >   */
> > >  void tick_nohz_full_kick_cpu(int cpu)
> > >  {
> > > -	if (!tick_nohz_full_cpu(cpu))
> > > +	smp_rmb();
> > 
> > What is it ordering?
> 
> ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
> should be ordered against writing in
> task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
> and task_isolation_start().
> 
> Since task_isolation_on_cpu() is often called for multiple CPUs in a
> sequence, it would be wasteful to include a barrier inside it.

Then I think you meant a full barrier: smp_mb()

> 
> > > +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> > >  		return;
> > 
> > You can't simply ignore an IPI. There is always a reason for a
> > nohz_full CPU
> > to be kicked. Something triggered a tick dependency. It can be posix
> > cpu timers
> > for example, or anything.
> 
> I realize that this is unusual, however the idea is that while the task
> is running in isolated mode in userspace, we assume that from this CPUs
> point of view whatever is happening in kernel, can wait until CPU is
> back in kernel and when it first enters kernel from this mode, it
> should "catch up" with everything that happened in its absence.
> task_isolation_kernel_enter() is supposed to do that, so by the time
> anything should be done involving the rest of the kernel, CPU is back
> to normal.

You can't assume that. If something needs the tick, this can't wait.
If the user did something wrong, such as setting a posix cpu timer
to an isolated task, that's his fault and the kernel has to stick with
correctness and kick that task out of isolation mode.

> 
> It is application's responsibility to avoid triggering things that
> break its isolation

Precisely.

> so the application assumes that everything that
> involves entering kernel will not be available while it is isolated.

We can't do things that way and just ignore IPIs. You need to solve the
source of the noise, not the symptoms.

Thanks.
Alex Belits Oct. 17, 2020, 12:17 a.m. UTC | #4
On Tue, 2020-10-06 at 23:41 +0200, Frederic Weisbecker wrote:
> On Sun, Oct 04, 2020 at 03:22:09PM +0000, Alex Belits wrote:
> > On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> > > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > > >   */
> > > >  void tick_nohz_full_kick_cpu(int cpu)
> > > >  {
> > > > -	if (!tick_nohz_full_cpu(cpu))
> > > > +	smp_rmb();
> > > 
> > > What is it ordering?
> > 
> > ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
> > should be ordered against writing in
> > task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
> > and task_isolation_start().
> > 
> > Since task_isolation_on_cpu() is often called for multiple CPUs in
> > a
> > sequence, it would be wasteful to include a barrier inside it.
> 
> Then I think you meant a full barrier: smp_mb()

For read-only operation? task_isolation_on_cpu() is the only place
where per-cpu ll_isol_flags is accessed, read-only, from multiple CPUs.
All other access to ll_isol_flags is done from the local CPU, and
writes are followed by smp_mb(). There are no other dependencies here,
except operations that depend on the value returned from
task_isolation_on_cpu().

If/when more flags will be added, those rules will be still followed,
because the intention is to store the state of isolation and phases of
entering/breaking/reporting it that can only be updated from the local
CPUs.

> 
> > > > +	if (!tick_nohz_full_cpu(cpu) ||
> > > > task_isolation_on_cpu(cpu))
> > > >  		return;
> > > 
> > > You can't simply ignore an IPI. There is always a reason for a
> > > nohz_full CPU
> > > to be kicked. Something triggered a tick dependency. It can be
> > > posix
> > > cpu timers
> > > for example, or anything.

This was added some time ago, when timers appeared and CPUs were kicked
seemingly out of nowhere. At that point breaking posix timers when
running tasks that are not supposed to rely on posix timers, was the
least problematic solution. From user's point of view in this case
entering isolation had an effect on timer similar to task exiting while
the timer is running.

Right now, there are still sources of superfluous calls to this, when
tick_nohz_full_kick_all() is used. If I will be able to confirm that
this is the only problematic place, I would rather fix calls to it, and
make this condition produce a warning.

This gives me an idea that if there will be a mechanism specifically
for reporting kernel entry and isolation breaking, maybe it should be
possible to add a distinction between:

1. isolation breaking that already happened upon kernel entry;
2. performing operation that will immediately and synchronously cause
isolation breaking;
3. operations or conditions that will eventually or asynchronously
cause isolation breaking (having timers running, possibly sending
signals should be in the same category).

This will be (2).

I assume that when reporting of isolation breaking will be separated
from the isolation implementation, it will be implemented as a runtime
error condition reporting mechanism. Then it can be focused on
providing information about category of events and their sources, and
have internal logic designed for that purpose, as opposed to designed
entirely for debugging, providing flexibility and obtaining maximum
details about internals involved.

> > 
> > I realize that this is unusual, however the idea is that while the
> > task
> > is running in isolated mode in userspace, we assume that from this
> > CPUs
> > point of view whatever is happening in kernel, can wait until CPU
> > is
> > back in kernel and when it first enters kernel from this mode, it
> > should "catch up" with everything that happened in its absence.
> > task_isolation_kernel_enter() is supposed to do that, so by the
> > time
> > anything should be done involving the rest of the kernel, CPU is
> > back
> > to normal.
> 
> You can't assume that. If something needs the tick, this can't wait.
> If the user did something wrong, such as setting a posix cpu timer
> to an isolated task, that's his fault and the kernel has to stick
> with
> correctness and kick that task out of isolation mode.

That would be true if not multiple "let's just tell all other CPUs that
they should check if they have to update something" situations like the
above.

In case of timers it's possible that I will be able to eliminate all
specific instances when this is done, however I think that as a general
approach we have to establish some distinction between things that must
cause IPI (and break isolation) and things that may be delayed until
the isolated userspace task will allow that or some other unavoidable
isolation-breaking event will happen.

> 
> > It is application's responsibility to avoid triggering things that
> > break its isolation
> 
> Precisely.

Right. However there are tings like tick_nohz_full_kick_all() and
similar procedures that result in mass-sending of IPIs without
determining if target CPUs have anything to do with the event at all,
leave alone have to handle it right now, it does not give me an
impression that we can blame application for it. I realize that this is
done for a reason, with the assumption that sending IPIs is "cheaper"
and does not require complex synchronization compared to determining
what and when should be notified, however this is not compatible with
goals of task isolation.

> 
> > so the application assumes that everything that
> > involves entering kernel will not be available while it is
> > isolated.
> 
> We can't do things that way and just ignore IPIs. You need to solve
> the
> source of the noise, not the symptoms.

It may be that eventually we can completely eliminate those things (at
least when isolation is enabled and this is relevant), however for the
purpose of having usable code without massive changes in numerous
callers, in my opinion, we should acknowledge that some things should
be disabled while the task is isolated, and called on isolation exit --
either unconditionally or conditionally if they were requested while
the task was isolated.

I believe that as long as we create a distinction between "must break
isolation", "delayed until the end of isolation" and "can be safely
ignored if the task is isolated" IPIs, we will end up with less
intrusive changes and reliably working functionality.

Then if we will be able to eliminate the sources of things in the last
two categories, we can treat them as if they were in the first one.

It may be that the timers are already ready to this, and I should just
check what causes tick_nohz_full_kick_all() calls. If so, this
particular check won't be necessary because all calls will happen for a
good reason in situations controlled by application. However as a
general approach I think, we need this longer way with decisions about
delaying or ignoring events.
diff mbox series

Patch

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6e4cd8459f05..2f82a6daf8fc 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@ 
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/isolation.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -268,7 +269,8 @@  static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-	if (!tick_nohz_full_cpu(cpu))
+	smp_rmb();
+	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
 		return;
 
 	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);