Message ID | 20100409062118.D4096CBB6C@localhost.localdomain (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > When calculating capacity we use the following calculation: > > capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); > > In SMT2, power will be 1178/2 (provided we are not scaling power with > freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity > being 1. > > With SMT4 however, power will be 1178/4, hence capacity will end up as > 0. > > Fix this by ensuring capacity is always at least 1 after this > calculation. > > Signed-off-by: Michael Neuling <mikey@neuling.org> > --- > I'm not sure this is the correct fix but this works for me. Right, so I suspect this will indeed break some things. We initially allowed 0 capacity for when a cpu is consumed by an RT task and there simply isn't much capacity left, in that case you really want to try and move load to your sibling cpus if possible. However you're right that this goes awry in your case. One thing to look at is if that 15% increase is indeed representative for the power7 cpu, it having 4 SMT threads seems to suggest there was significant gains, otherwise they'd not have wasted the silicon. (The broken x86 code was meant to actually compute the SMT gain, so that we'd not have to guess the 15%) Now, increasing this will only marginally fix the issue, since if you end up with 512 per thread it only takes a very tiny amount of RT workload to drop below and end up at 0 again. One thing we could look at is using the cpu base power to compute capacity from. We'd have to add another field to sched_group and store power before we do the scale_rt_power() stuff. Thoughts? > kernel/sched_fair.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > Index: linux-2.6-ozlabs/kernel/sched_fair.c > =================================================================== > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c > +++ linux-2.6-ozlabs/kernel/sched_fair.c > @@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta > } > > capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); > + capacity = max(capacity, 1UL); > > if (tmp->flags & SD_POWERSAVINGS_BALANCE) > nr_running /= 2; > @@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st > > sgs->group_capacity = > DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); > + sgs->group_capacity = max(sgs->group_capacity, 1UL); > } > > /** > @@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g > > for_each_cpu(i, sched_group_cpus(group)) { > unsigned long power = power_of(i); > - unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); > + unsigned long capacity; > unsigned long wl; > > + capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL); > + > if (!cpumask_test_cpu(i, cpus)) > continue; >
In message <1271161766.4807.1280.camel@twins> you wrote: > On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote: > > When calculating capacity we use the following calculation: > >=20 > > capacity =3D DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); > >=20 > > In SMT2, power will be 1178/2 (provided we are not scaling power with > > freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity > > being 1. > >=20 > > With SMT4 however, power will be 1178/4, hence capacity will end up as > > 0. > >=20 > > Fix this by ensuring capacity is always at least 1 after this > > calculation. > >=20 > > Signed-off-by: Michael Neuling <mikey@neuling.org> > > --- > > I'm not sure this is the correct fix but this works for me. =20 > > Right, so I suspect this will indeed break some things. > > We initially allowed 0 capacity for when a cpu is consumed by an RT task > and there simply isn't much capacity left, in that case you really want > to try and move load to your sibling cpus if possible. Changing the CPU power based on what tasks are running on them seems a bit wrong to me. Shouldn't we keep those concepts separate? > However you're right that this goes awry in your case. > > One thing to look at is if that 15% increase is indeed representative > for the power7 cpu, it having 4 SMT threads seems to suggest there was > significant gains, otherwise they'd not have wasted the silicon. There are certainly, for most workloads, per core gains for SMT4 over SMT2 on P7. My kernels certainly compile faster and that's the only workload anyone who matters cares about.... ;-) > (The broken x86 code was meant to actually compute the SMT gain, so that > we'd not have to guess the 15%) > > Now, increasing this will only marginally fix the issue, since if you > end up with 512 per thread it only takes a very tiny amount of RT > workload to drop below and end up at 0 again. I tried initialled to make smt_gain programmable and at 2048 per core (512 per thread), the packing became unstable, as you intimated. > One thing we could look at is using the cpu base power to compute > capacity from. We'd have to add another field to sched_group and store > power before we do the scale_rt_power() stuff. Separating capacity from what RT tasks are running seems like a good idea to me. This would fix the RT issue, but it's not clear to me how you are suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate the RT tasks out from capacity and keep the max(1, capacity) that I've added? Or something else? Would another possibility be changing capacity a scaled value (like cpu_power is now) rather than a small integer as it is now. For example, a scaled capacity of 1024 would be equivalent to a capacity of 1 now. This might enable us to handle partial capacities better? We'd probably have to scale a bunch of nr_running too. Mikey
On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote: > > Right, so I suspect this will indeed break some things. > > > > We initially allowed 0 capacity for when a cpu is consumed by an RT task > > and there simply isn't much capacity left, in that case you really want > > to try and move load to your sibling cpus if possible. > > Changing the CPU power based on what tasks are running on them seems a > bit wrong to me. Shouldn't we keep those concepts separate? Well the thing cpu_power represents is a ratio of compute capacity available to this cpu as compared to other cpus. By normalizing the runqueue weights with this we end up with a fair balance. The thing to realize here is that this is solely about SCHED_NORMAL tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about fairness and available compute capacity. So if we were to ignore RT tasks, you'd end up with a situation where, assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the load-balancer would give each cpu 2 NORMAL tasks, but the tasks that would end up on the cpu the RT tasks would be running on would not run as fast -- is that fair? Since RT tasks do not have a weight (FIFO/RR have no limit at all, DEADLINE would have something equivalent to a max weight), it is impossible to account them in the normal weight sense. Therefore the current model takes them into account by lowering the compute capacity according to their (avg) cpu usage. So if the RT task would consume 66% cputime, we'd end up with a situation where the cpu running the RT task would get 1 NORMAL task, and other cpu would have the remaining 3, that way they'd all get 33% cpu. > > However you're right that this goes awry in your case. > > > > One thing to look at is if that 15% increase is indeed representative > > for the power7 cpu, it having 4 SMT threads seems to suggest there was > > significant gains, otherwise they'd not have wasted the silicon. > > There are certainly, for most workloads, per core gains for SMT4 over > SMT2 on P7. My kernels certainly compile faster and that's the only > workload anyone who matters cares about.... ;-) For sure ;-) Are there any numbers available on how much they gain? It might be worth to stick in real numbers instead of this alleged 15%. > > One thing we could look at is using the cpu base power to compute > > capacity from. We'd have to add another field to sched_group and store > > power before we do the scale_rt_power() stuff. > > Separating capacity from what RT tasks are running seems like a good > idea to me. Well, per the above we cannot fully separate them. > This would fix the RT issue, but it's not clear to me how you are > suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting > we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate > the RT tasks out from capacity and keep the max(1, capacity) that I've > added? Or something else? I would think that 4 SMT threads are still slower than two full cores, right? So cpu_power=2048 would not be appropriate. > Would another possibility be changing capacity a scaled value (like > cpu_power is now) rather than a small integer as it is now. For > example, a scaled capacity of 1024 would be equivalent to a capacity of > 1 now. This might enable us to handle partial capacities better? We'd > probably have to scale a bunch of nr_running too. Right, so my proposal was to scale down the capacity divider (currently 1024) to whatever would be the base capacity for that cpu. Trouble seems to be that that makes group capacity a lot more complex, as you would end up needing to average all the cpu's their base capacity. Hrmm, my brain seems muddled but I might have another solution, let me ponder this for a bit..
In message <1271426308.1674.429.camel@laptop> you wrote: > On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote: > > > > Right, so I suspect this will indeed break some things. > > > > > > We initially allowed 0 capacity for when a cpu is consumed by an RT task > > > and there simply isn't much capacity left, in that case you really want > > > to try and move load to your sibling cpus if possible. > > > > Changing the CPU power based on what tasks are running on them seems a > > bit wrong to me. Shouldn't we keep those concepts separate? > > Well the thing cpu_power represents is a ratio of compute capacity > available to this cpu as compared to other cpus. By normalizing the > runqueue weights with this we end up with a fair balance. > > The thing to realize here is that this is solely about SCHED_NORMAL > tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about > fairness and available compute capacity. > > So if we were to ignore RT tasks, you'd end up with a situation where, > assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the > load-balancer would give each cpu 2 NORMAL tasks, but the tasks that > would end up on the cpu the RT tasks would be running on would not run > as fast -- is that fair? > > Since RT tasks do not have a weight (FIFO/RR have no limit at all, > DEADLINE would have something equivalent to a max weight), it is > impossible to account them in the normal weight sense. > > Therefore the current model takes them into account by lowering the > compute capacity according to their (avg) cpu usage. So if the RT task > would consume 66% cputime, we'd end up with a situation where the cpu > running the RT task would get 1 NORMAL task, and other cpu would have > the remaining 3, that way they'd all get 33% cpu. Thanks for the explanation. Your last example makes perfect sense for me. > > > > However you're right that this goes awry in your case. > > > > > > One thing to look at is if that 15% increase is indeed representative > > > for the power7 cpu, it having 4 SMT threads seems to suggest there was > > > significant gains, otherwise they'd not have wasted the silicon. > > > > There are certainly, for most workloads, per core gains for SMT4 over > > SMT2 on P7. My kernels certainly compile faster and that's the only > > workload anyone who matters cares about.... ;-) > > For sure ;-) > > Are there any numbers available on how much they gain? It might be worth > to stick in real numbers instead of this alleged 15%. I get some gain numbers but obviously the workloads makes a huge difference. From a scheduler perspective, I assume an average/representative gain is best rather than an optimistic or pessimistic one? We'll have different gains for SMT2 and SMT4, so we could change the gain dynamically based on which SMT mode we are in. Does that seem like something we should add as an arch hook? > > > One thing we could look at is using the cpu base power to compute > > > capacity from. We'd have to add another field to sched_group and store > > > power before we do the scale_rt_power() stuff. > > > > Separating capacity from what RT tasks are running seems like a good > > idea to me. > > Well, per the above we cannot fully separate them. Yep. > > > This would fix the RT issue, but it's not clear to me how you are > > suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting > > we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate > > the RT tasks out from capacity and keep the max(1, capacity) that I've > > added? Or something else? > > I would think that 4 SMT threads are still slower than two full cores, > right? So cpu_power=2048 would not be appropriate. Yes for an average workload, 4 SMT threads are slower than 2 full cores. > > > Would another possibility be changing capacity a scaled value (like > > cpu_power is now) rather than a small integer as it is now. For > > example, a scaled capacity of 1024 would be equivalent to a capacity of > > 1 now. This might enable us to handle partial capacities better? We'd > > probably have to scale a bunch of nr_running too. > > Right, so my proposal was to scale down the capacity divider (currently > 1024) to whatever would be the base capacity for that cpu. Trouble seems > to be that that makes group capacity a lot more complex, as you would > end up needing to average all the cpu's their base capacity. > > Hrmm, my brain seems muddled but I might have another solution, let me > ponder this for a bit.. Great! Mikey
On Mon, 2010-04-19 at 07:34 +1000, Michael Neuling wrote: > > Are there any numbers available on how much they gain? It might be worth > > to stick in real numbers instead of this alleged 15%. > > I get some gain numbers but obviously the workloads makes a huge > difference. From a scheduler perspective, I assume an > average/representative gain is best rather than an optimistic or > pessimistic one? Yeah, average would be best. > We'll have different gains for SMT2 and SMT4, so we could change the > gain dynamically based on which SMT mode we are in. Does that seem like > something we should add as an arch hook? That's the sort of thing you can use arch_scale_smt_power() for. But be weary to not fall into the same trap I did with x86, where I confused actual gain with capacity (When idle the actual gain is 0, but the capacity is not).
In message <1271688543.1488.253.camel@laptop> you wrote: > On Mon, 2010-04-19 at 07:34 +1000, Michael Neuling wrote: > > > Are there any numbers available on how much they gain? It might be worth > > > to stick in real numbers instead of this alleged 15%. > > > > I get some gain numbers but obviously the workloads makes a huge > > difference. From a scheduler perspective, I assume an > > average/representative gain is best rather than an optimistic or > > pessimistic one? > > Yeah, average would be best. Ok. > > We'll have different gains for SMT2 and SMT4, so we could change the > > gain dynamically based on which SMT mode we are in. Does that seem like > > something we should add as an arch hook? > > That's the sort of thing you can use arch_scale_smt_power() for. But be > weary to not fall into the same trap I did with x86, where I confused > actual gain with capacity (When idle the actual gain is 0, but the > capacity is not). Oops, yes of course :-) <from before> > Hrmm, my brain seems muddled but I might have another solution, let me > ponder this for a bit.. Let me know if/when you come up this solution or if I can help. Mikey
In message <1271426308.1674.429.camel@laptop> you wrote: > On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote: > > > > Right, so I suspect this will indeed break some things. > > > > > > We initially allowed 0 capacity for when a cpu is consumed by an RT task > > > and there simply isn't much capacity left, in that case you really want > > > to try and move load to your sibling cpus if possible. > > > > Changing the CPU power based on what tasks are running on them seems a > > bit wrong to me. Shouldn't we keep those concepts separate? > > Well the thing cpu_power represents is a ratio of compute capacity > available to this cpu as compared to other cpus. By normalizing the > runqueue weights with this we end up with a fair balance. > > The thing to realize here is that this is solely about SCHED_NORMAL > tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about > fairness and available compute capacity. > > So if we were to ignore RT tasks, you'd end up with a situation where, > assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the > load-balancer would give each cpu 2 NORMAL tasks, but the tasks that > would end up on the cpu the RT tasks would be running on would not run > as fast -- is that fair? > > Since RT tasks do not have a weight (FIFO/RR have no limit at all, > DEADLINE would have something equivalent to a max weight), it is > impossible to account them in the normal weight sense. > > Therefore the current model takes them into account by lowering the > compute capacity according to their (avg) cpu usage. So if the RT task > would consume 66% cputime, we'd end up with a situation where the cpu > running the RT task would get 1 NORMAL task, and other cpu would have > the remaining 3, that way they'd all get 33% cpu. > > > > However you're right that this goes awry in your case. > > > > > > One thing to look at is if that 15% increase is indeed representative > > > for the power7 cpu, it having 4 SMT threads seems to suggest there was > > > significant gains, otherwise they'd not have wasted the silicon. > > > > There are certainly, for most workloads, per core gains for SMT4 over > > SMT2 on P7. My kernels certainly compile faster and that's the only > > workload anyone who matters cares about.... ;-) > > For sure ;-) > > Are there any numbers available on how much they gain? It might be worth > to stick in real numbers instead of this alleged 15%. > > > > One thing we could look at is using the cpu base power to compute > > > capacity from. We'd have to add another field to sched_group and store > > > power before we do the scale_rt_power() stuff. > > > > Separating capacity from what RT tasks are running seems like a good > > idea to me. > > Well, per the above we cannot fully separate them. > > > This would fix the RT issue, but it's not clear to me how you are > > suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting > > we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate > > the RT tasks out from capacity and keep the max(1, capacity) that I've > > added? Or something else? > > I would think that 4 SMT threads are still slower than two full cores, > right? So cpu_power=2048 would not be appropriate. > > > Would another possibility be changing capacity a scaled value (like > > cpu_power is now) rather than a small integer as it is now. For > > example, a scaled capacity of 1024 would be equivalent to a capacity of > > 1 now. This might enable us to handle partial capacities better? We'd > > probably have to scale a bunch of nr_running too. > > Right, so my proposal was to scale down the capacity divider (currently > 1024) to whatever would be the base capacity for that cpu. Trouble seems > to be that that makes group capacity a lot more complex, as you would > end up needing to average all the cpu's their base capacity. > > > Hrmm, my brain seems muddled but I might have another solution, let me > ponder this for a bit.. > Peter, Did you manage to get anywhere on this capacity issue? Mikey
On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote: > > > Hrmm, my brain seems muddled but I might have another solution, let me > ponder this for a bit.. > Right, so the thing I was thinking about is taking the group capacity into account when determining the capacity for a single cpu. Say the group contains all the SMT siblings, then use the group capacity (usually larger than 1024) and then distribute the capacity over the group members, preferring CPUs with higher individual cpu_power over those with less. So suppose you've got 4 siblings with cpu_power=294 each, then we assign capacity 1 to the first member, and the remaining 153 is insufficient, and thus we stop and the rest lives with 0 capacity. Now take the example that the first sibling would be running a heavy RT load, and its cpu_power would be reduced to say, 50, then we still got nearly 933 left over the others, which is still sufficient for one capacity, but because the first sibling is low, we'll assign it 0 and instead assign 1 to the second, again, leaving the third and fourth 0. If the group were a core group, the total would be much higher and we'd likely end up assigning 1 to each before we'd run out of capacity. For power savings, we can lower the threshold and maybe use the maximal individual cpu_power in the group to base 1 capacity from. So, suppose the second example, where sibling0 has 50 and the others have 294, you'd end up with a capacity distribution of: {0,1,1,1}.
* Peter Zijlstra <peterz@infradead.org> [2010-05-31 10:33:16]: > On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote: > > > > > > Hrmm, my brain seems muddled but I might have another solution, let me > > ponder this for a bit.. > > > > Right, so the thing I was thinking about is taking the group capacity > into account when determining the capacity for a single cpu. > > Say the group contains all the SMT siblings, then use the group capacity > (usually larger than 1024) and then distribute the capacity over the > group members, preferring CPUs with higher individual cpu_power over > those with less. > > So suppose you've got 4 siblings with cpu_power=294 each, then we assign > capacity 1 to the first member, and the remaining 153 is insufficient, > and thus we stop and the rest lives with 0 capacity. > > Now take the example that the first sibling would be running a heavy RT > load, and its cpu_power would be reduced to say, 50, then we still got > nearly 933 left over the others, which is still sufficient for one > capacity, but because the first sibling is low, we'll assign it 0 and > instead assign 1 to the second, again, leaving the third and fourth 0. Hi Peter, Thanks for the suggestion. > If the group were a core group, the total would be much higher and we'd > likely end up assigning 1 to each before we'd run out of capacity. This is a tricky case because we are depending upon the DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We will not have any task movement until capacity is depleted to quite low value due to RT task. Having a threshold to flag 0/1 instead of DIV_ROUND_CLOSEST just like you have suggested in the power savings case may help here as well to move tasks to other idle cores. > For power savings, we can lower the threshold and maybe use the maximal > individual cpu_power in the group to base 1 capacity from. > > So, suppose the second example, where sibling0 has 50 and the others > have 294, you'd end up with a capacity distribution of: {0,1,1,1}. One challenge here is that if RT tasks run on more that one thread in this group, we will have slightly different cpu powers. Arranging them from max to min and having a cutoff threshold should work. Should we keep the RT scaling as a separate entity along with cpu_power to simplify these thresholds. Whenever we need to scale group load with cpu power can take the product of cpu_power and scale_rt_power but in these cases where we compute capacity, we can mark a 0 or 1 just based on whether scale_rt_power was less than SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as a product of all scaling factors as it is today but save the component scale factors also like scale_rt_power() and arch_scale_freq_power() so that it can be used in load balance decisions. Basically in power save balance we would give all threads a capacity '1' unless the cpu_power was reduced due to RT task. Similarly in the non-power save case, we can have flag 1,0,0,0 unless first thread had a RT scaling during the last interval. I am suggesting to distinguish the reduction is cpu_power due to architectural (hardware DVFS) reasons from RT tasks so that it is easy to decide if moving tasks to sibling thread or core can help or not. --Vaidy
On Wed, 2010-06-02 at 04:22 +0530, Vaidyanathan Srinivasan wrote: > > If the group were a core group, the total would be much higher and we'd > > likely end up assigning 1 to each before we'd run out of capacity. > > This is a tricky case because we are depending upon the > DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We > will not have any task movement until capacity is depleted to quite > low value due to RT task. Having a threshold to flag 0/1 instead of > DIV_ROUND_CLOSEST just like you have suggested in the power savings > case may help here as well to move tasks to other idle cores. Right, well we could put the threshold higher than the 50%, say 90% or so. > > For power savings, we can lower the threshold and maybe use the maximal > > individual cpu_power in the group to base 1 capacity from. > > > > So, suppose the second example, where sibling0 has 50 and the others > > have 294, you'd end up with a capacity distribution of: {0,1,1,1}. > > One challenge here is that if RT tasks run on more that one thread in > this group, we will have slightly different cpu powers. Arranging > them from max to min and having a cutoff threshold should work. Right, like the 90% above. > Should we keep the RT scaling as a separate entity along with > cpu_power to simplify these thresholds. Whenever we need to scale > group load with cpu power can take the product of cpu_power and > scale_rt_power but in these cases where we compute capacity, we can > mark a 0 or 1 just based on whether scale_rt_power was less than > SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as > a product of all scaling factors as it is today but save the component > scale factors also like scale_rt_power() and arch_scale_freq_power() > so that it can be used in load balance decisions. Right, so the question is, do we only care about RT or should capacity reflect the full asymmetric MP case. I don't quite see why RT is special from any of the other scale factors, if someone pegged one core at half the frequency of the others you'd still want it to get 0 capacity so that we only try to populate it on overload. > Basically in power save balance we would give all threads a capacity > '1' unless the cpu_power was reduced due to RT task. Similarly in > the non-power save case, we can have flag 1,0,0,0 unless first thread > had a RT scaling during the last interval. > > I am suggesting to distinguish the reduction is cpu_power due to > architectural (hardware DVFS) reasons from RT tasks so that it is easy > to decide if moving tasks to sibling thread or core can help or not. For power savings such a special heuristic _might_ make sense.
Index: linux-2.6-ozlabs/kernel/sched_fair.c =================================================================== --- linux-2.6-ozlabs.orig/kernel/sched_fair.c +++ linux-2.6-ozlabs/kernel/sched_fair.c @@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta } capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); + capacity = max(capacity, 1UL); if (tmp->flags & SD_POWERSAVINGS_BALANCE) nr_running /= 2; @@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); + sgs->group_capacity = max(sgs->group_capacity, 1UL); } /** @@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g for_each_cpu(i, sched_group_cpus(group)) { unsigned long power = power_of(i); - unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); + unsigned long capacity; unsigned long wl; + capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL); + if (!cpumask_test_cpu(i, cpus)) continue;
When calculating capacity we use the following calculation: capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); In SMT2, power will be 1178/2 (provided we are not scaling power with freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity being 1. With SMT4 however, power will be 1178/4, hence capacity will end up as 0. Fix this by ensuring capacity is always at least 1 after this calculation. Signed-off-by: Michael Neuling <mikey@neuling.org> --- I'm not sure this is the correct fix but this works for me. Original post here: http://lkml.org/lkml/2010/3/30/884 --- kernel/sched_fair.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)