Message ID | 1271200751-18697-1-git-send-email-chase.douglas@canonical.com |
---|---|
State | Superseded |
Delegated to: | Stefan Bader |
Headers | show |
On Mon, 2010-04-19 at 20:52 +0200, Peter Zijlstra wrote: > > So the only early updates can come from > pick_next_task_idle()->calc_load_account_active(), so why not specialize > that callchain instead of the below? To clarify, when I wrote that your patch was still below.. ;-)
On Mon, Apr 19, 2010 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, 2010-04-13 at 16:19 -0700, Chase Douglas wrote: >> There's a period of 10 ticks where calc_load_tasks is updated by all the >> cpus for the load avg. Usually all the cpus do this during the first >> tick. If any cpus go idle, calc_load_tasks is decremented accordingly. >> However, if they wake up calc_load_tasks is not incremented. Thus, if >> cpus go idle during the 10 tick period, calc_load_tasks may be >> decremented to a non-representative value. This issue can lead to >> systems having a load avg of exactly 0, even though the real load avg >> could theoretically be up to NR_CPUS. >> >> This change defers calc_load_tasks accounting after each cpu updates the >> count until after the 10 tick update window. >> >> A few points: >> >> * A global atomic deferral counter, and not per-cpu vars, is needed >> because a cpu may go NOHZ idle and not be able to update the global >> calc_load_tasks variable for subsequent load calculations. >> * It is not enough to add calls to account for the load when a cpu is >> awakened: >> - Load avg calculation must be independent of cpu load. >> - If a cpu is awakend by one tasks, but then has more scheduled before >> the end of the update window, only the first task will be accounted. > > OK, so what you're saying is that because we update calc_load_tasks from > entering idle, we decrease earlier than a regular 10 tick sample > interval would? > > Hence you batch these early updates into _deferred and let the next 10 > tick sample roll them over? Correct > So the only early updates can come from > pick_next_task_idle()->calc_load_account_active(), so why not specialize > that callchain instead of the below? > > Also, since its all NO_HZ, why not stick this in with the ILB? Once > people get around to making that scale better, this can hitch a ride. > > Something like the below perhaps? It does run partially from softirq > context, but since there's a distinct lack of synchronization here that > didn't seem like an immediate problem. I understand everything until you move the calc_load_account_active call to run_rebalance_domains. I take it that when CPUs go NO_HZ idle, at least one cpu is left to monitor and perform updates as necessary. Conceptually, it makes sense that this cpu should be handling the load accounting updates. However, I'm new to this code, so I'm having a hard time understanding all the cases and timings for when the scheduler softirq is called. Is it guaranteed to be called during every 10 tick load update window? If not, then we'll have the issue where a NO_HZ idle cpu won't be updated to 0 running tasks in time for the load avg calculation. Would someone be able to explain how we are guaranteed of the correct timing for this path? I also have a concern with run_rebalance_domains: If the designated no_hz.load_balancer cpu wasn't idle at the last tick or needs rescheduling, load accounting won't occur for idle cpus. Is it possible for this to occur every time when called in the 10 tick update window? -- Chase
diff --git a/kernel/sched.c b/kernel/sched.c index abb36b1..be348cd 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3010,6 +3010,7 @@ unsigned long this_cpu_load(void) /* Variables and functions for calc_load */ static atomic_long_t calc_load_tasks; +static atomic_long_t calc_load_tasks_deferred; static unsigned long calc_load_update; unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); @@ -3064,7 +3065,7 @@ void calc_global_load(void) */ static void calc_load_account_active(struct rq *this_rq) { - long nr_active, delta; + long nr_active, delta, deferred; nr_active = this_rq->nr_running; nr_active += (long) this_rq->nr_uninterruptible; @@ -3072,6 +3073,25 @@ static void calc_load_account_active(struct rq *this_rq) if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; this_rq->calc_load_active = nr_active; + + /* + * Update calc_load_tasks only once per cpu in 10 tick update + * window. + */ + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && + time_after_eq(jiffies, calc_load_update))) { + if (delta) + atomic_long_add(delta, + &calc_load_tasks_deferred); + return; + } + + if (atomic_long_read(&calc_load_tasks_deferred)) { + deferred = atomic_long_xchg(&calc_load_tasks_deferred, + 0); + delta += deferred; + } + atomic_long_add(delta, &calc_load_tasks); } } @@ -3106,8 +3126,8 @@ static void update_cpu_load(struct rq *this_rq) } if (time_after_eq(jiffies, this_rq->calc_load_update)) { - this_rq->calc_load_update += LOAD_FREQ; calc_load_account_active(this_rq); + this_rq->calc_load_update += LOAD_FREQ; } }