Message ID | alpine.LFD.2.01.0907081701050.3352@localhost.localdomain |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, 8 Jul 2009 17:01:14 -0700 (PDT) > So ACK on Anton's patch, I'll integrate it into my tree, thanks. > but I wonder if we _also_ should do the following? I think we should set SYSTEM_RUNNING earlier, makes sense to me. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote: > Looking at the people looking at SYSTEM_RUNNING, I do note some odd cases. > Why the heck does kernel/perf_counter.c do it, for example? Ah, those are the swcounter and other probe entry points. I've had several cases where we called into the perf counter code from those points before it was initialized, getting in kernel segfaults due to dereferencing uninitialized data etc.. I could keep a variable that tracked the perf_counter_init() state, and use that instead if you prefer? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-07-09 at 09:56 +0200, Peter Zijlstra wrote: > On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote: > > Looking at the people looking at SYSTEM_RUNNING, I do note some odd cases. > > Why the heck does kernel/perf_counter.c do it, for example? > > Ah, those are the swcounter and other probe entry points. I've had > several cases where we called into the perf counter code from those > points before it was initialized, getting in kernel segfaults due to > dereferencing uninitialized data etc.. > > I could keep a variable that tracked the perf_counter_init() state, and > use that instead if you prefer? Looks like that'd be more accurate. Linus's proposed patch might break your current assumptions.
On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote: > > On Thu, 9 Jul 2009, Anton Vorontsov wrote: > > > > The netpoll code is using msleep() just a few lines below cond_resched(), > > so we won't make things worse. ;-) > > Yeah. That function is definitely sleeping. It does things like > kmalloc(GFP_KERNEL), rtnl_lock() and synchronize_rcu() etc too, so an > added msleep() is the least of our problems. > > Afaik, it's called from a bog-standard "module_init()", which happens late > enough that everything works. > > In fact, I wonder if we should set SYSTEM_RUNNING much earlier - _before_ > doing the whole "do_initcalls()". Well there are two ways of consistently defining SYSTEM_RUNNING: a) define it with reference to the well-understood notion of booting vs running and don't switch it until handing off to init b) define it with reference to its usage by an arbitrary user like cond_resched() In the latter case, we obviously need to move it to the earliest point that scheduling is possible. But there are a number of things like http://lxr.linux.no/linux+v2.6.30/kernel/printk.c#L228 that assume the definition is actually (a). We're currently within a couple lines of a strict definition of (a) already, so I actually think cond_resched() is just wrong (and we already know it broke a previously-working user). It should perhaps be using another private flag that gets set as soon as scheduling is up and running. But I'd actually go further and say that it's unfortunate to be checking extra flags in such an important inline, especially since the check is false for all but the first couple seconds of run time. Seems like we could avoid adding an extra check by artificially elevating the preempt count in early boot (or at compile time) then dropping it when scheduling becomes available.
On Thu, 2009-07-09 at 08:26 -0500, Matt Mackall wrote: > On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote: > > > > On Thu, 9 Jul 2009, Anton Vorontsov wrote: > > > > > > The netpoll code is using msleep() just a few lines below cond_resched(), > > > so we won't make things worse. ;-) > > > > Yeah. That function is definitely sleeping. It does things like > > kmalloc(GFP_KERNEL), rtnl_lock() and synchronize_rcu() etc too, so an > > added msleep() is the least of our problems. > > > > Afaik, it's called from a bog-standard "module_init()", which happens late > > enough that everything works. > > > > In fact, I wonder if we should set SYSTEM_RUNNING much earlier - _before_ > > doing the whole "do_initcalls()". > > Well there are two ways of consistently defining SYSTEM_RUNNING: > > a) define it with reference to the well-understood notion of booting vs > running and don't switch it until handing off to init This makes the most sense IMHO. > b) define it with reference to its usage by an arbitrary user like > cond_resched() > > In the latter case, we obviously need to move it to the earliest point > that scheduling is possible. But there are a number of things like > > http://lxr.linux.no/linux+v2.6.30/kernel/printk.c#L228 > > that assume the definition is actually (a). We're currently within a > couple lines of a strict definition of (a) already, so I actually think > cond_resched() is just wrong (and we already know it broke a > previously-working user). It should perhaps be using another private > flag that gets set as soon as scheduling is up and running. Right as mentioned before in this thread, we grew scheduler_running a while back which could be used for this. > But I'd actually go further and say that it's unfortunate to be checking > extra flags in such an important inline, especially since the check is > false for all but the first couple seconds of run time. Seems like we > could avoid adding an extra check by artificially elevating the preempt > count in early boot (or at compile time) then dropping it when > scheduling becomes available. Calling cond_resched() and co when !preemptable is an error so this wouldn't actually work. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-07-09 at 15:46 +0200, Peter Zijlstra wrote: > On Thu, 2009-07-09 at 08:26 -0500, Matt Mackall wrote: > > On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote: > > > > > > On Thu, 9 Jul 2009, Anton Vorontsov wrote: > > > > > > > > The netpoll code is using msleep() just a few lines below cond_resched(), > > > > so we won't make things worse. ;-) > > > > > > Yeah. That function is definitely sleeping. It does things like > > > kmalloc(GFP_KERNEL), rtnl_lock() and synchronize_rcu() etc too, so an > > > added msleep() is the least of our problems. > > > > > > Afaik, it's called from a bog-standard "module_init()", which happens late > > > enough that everything works. > > > > > > In fact, I wonder if we should set SYSTEM_RUNNING much earlier - _before_ > > > doing the whole "do_initcalls()". > > > > Well there are two ways of consistently defining SYSTEM_RUNNING: > > > > a) define it with reference to the well-understood notion of booting vs > > running and don't switch it until handing off to init > > This makes the most sense IMHO. > > > b) define it with reference to its usage by an arbitrary user like > > cond_resched() > > > > In the latter case, we obviously need to move it to the earliest point > > that scheduling is possible. But there are a number of things like > > > > http://lxr.linux.no/linux+v2.6.30/kernel/printk.c#L228 > > > > that assume the definition is actually (a). We're currently within a > > couple lines of a strict definition of (a) already, so I actually think > > cond_resched() is just wrong (and we already know it broke a > > previously-working user). It should perhaps be using another private > > flag that gets set as soon as scheduling is up and running. > > Right as mentioned before in this thread, we grew scheduler_running a > while back which could be used for this. > > > But I'd actually go further and say that it's unfortunate to be checking > > extra flags in such an important inline, especially since the check is > > false for all but the first couple seconds of run time. Seems like we > > could avoid adding an extra check by artificially elevating the preempt > > count in early boot (or at compile time) then dropping it when > > scheduling becomes available. > > Calling cond_resched() and co when !preemptable is an error so this > wouldn't actually work. Sorry if I was unclear. I'm suggesting setting the count so the existing PREEMPT_ACTIVE test here fires: int __sched _cond_resched(void) { if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) && system_state == SYSTEM_RUNNING) { __cond_resched(); return 1; } return 0; } That should be kosher.
diff --git a/init/main.c b/init/main.c index 2c5ade7..f10d9cd 100644 --- a/init/main.c +++ b/init/main.c @@ -788,6 +788,7 @@ static void __init do_initcalls(void) { initcall_t *call; + system_state = SYSTEM_RUNNING; for (call = __early_initcall_end; call < __initcall_end; call++) do_one_initcall(*call); @@ -839,7 +840,6 @@ static noinline int init_post(void) free_initmem(); unlock_kernel(); mark_rodata_ro(); - system_state = SYSTEM_RUNNING; numa_default_policy(); if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)