Message ID | 20091008095027.GC20595@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Benjamin Herrenschmidt |
Headers | show |
On Thu, 2009-10-08 at 15:20 +0530, Arun R Bharadwaj wrote: > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-10-08 15:18:28]: > > Implement a list based registering mechanism for architectures which > have multiple sets of idle routines which are to be registered. > > Currently, in x86 it is done by merely setting pm_idle = idle_routine > and managing this pm_idle pointer is messy. > > To give an example of how this mechanism works: > In x86, initially, idle routine is selected from the set of poll/mwait/ > c1e/default idle loops. So the selected idle loop is registered in cpuidle > as one idle state cpuidle devices. Once ACPI comes up, it registers > another set of idle states on top of this state. Again, suppose a module > registers another set of idle loops, it is added to this list. > > This provides a clean way of registering and unregistering idle state > routines. So cpuidle didn't already have a list of idle functions it takes an appropriate one from? Then what does this governor do? Also, does this imply the governor doesn't consider these idle routines?
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:36:02]: > On Thu, 2009-10-08 at 15:20 +0530, Arun R Bharadwaj wrote: > > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-10-08 15:18:28]: > > > > Implement a list based registering mechanism for architectures which > > have multiple sets of idle routines which are to be registered. > > > > Currently, in x86 it is done by merely setting pm_idle = idle_routine > > and managing this pm_idle pointer is messy. > > > > To give an example of how this mechanism works: > > In x86, initially, idle routine is selected from the set of poll/mwait/ > > c1e/default idle loops. So the selected idle loop is registered in cpuidle > > as one idle state cpuidle devices. Once ACPI comes up, it registers > > another set of idle states on top of this state. Again, suppose a module > > registers another set of idle loops, it is added to this list. > > > > This provides a clean way of registering and unregistering idle state > > routines. > > So cpuidle didn't already have a list of idle functions it takes an > appropriate one from? > No.. As of now, cpuidle supported only one _set_ of idle states that can be registered. So in this one set, it would choose the appropriate idle state. But this list mechanism(actually a stack) allows for multiple sets. This is needed because we have a hierarchy of idle states discovery in x86. First, select_idle_routine() would select poll/mwait/default/c1e. It doesn't know of existance of ACPI. Later when ACPI comes up, it registers a set of routines on top of the earlier set. > Then what does this governor do? > The governor would only select the best idle state available from the set of states which is at the top of the stack. (In the above case, it would only consider the states registered by ACPI). If the top-of-the-stack set of idle states is unregistered, the next set of states on the stack are considered. > Also, does this imply the governor doesn't consider these idle routines? > As i said above, governor would only consider the idle routines which are at the top of the stack. Hope this gave a better idea..
On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote: > > > So cpuidle didn't already have a list of idle functions it takes an > > appropriate one from? > > > > No.. As of now, cpuidle supported only one _set_ of idle states that > can be registered. So in this one set, it would choose the appropriate > idle state. But this list mechanism(actually a stack) allows for > multiple sets. > > This is needed because we have a hierarchy of idle states discovery > in x86. First, select_idle_routine() would select poll/mwait/default/c1e. > It doesn't know of existance of ACPI. Later when ACPI comes up, > it registers a set of routines on top of the earlier set. > > > Then what does this governor do? > > > > The governor would only select the best idle state available from the > set of states which is at the top of the stack. (In the above case, it > would only consider the states registered by ACPI). > > If the top-of-the-stack set of idle states is unregistered, the next > set of states on the stack are considered. > > > Also, does this imply the governor doesn't consider these idle routines? > > > > As i said above, governor would only consider the idle routines which > are at the top of the stack. > > Hope this gave a better idea.. So does it make sense to have a set of sets? Why not integrate them all into one set to be ruled by this governor thing?
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:50:33]: > On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote: > > > > > So cpuidle didn't already have a list of idle functions it takes an > > > appropriate one from? > > > > > > > No.. As of now, cpuidle supported only one _set_ of idle states that > > can be registered. So in this one set, it would choose the appropriate > > idle state. But this list mechanism(actually a stack) allows for > > multiple sets. > > > > This is needed because we have a hierarchy of idle states discovery > > in x86. First, select_idle_routine() would select poll/mwait/default/c1e. > > It doesn't know of existance of ACPI. Later when ACPI comes up, > > it registers a set of routines on top of the earlier set. > > > > > Then what does this governor do? > > > > > > > The governor would only select the best idle state available from the > > set of states which is at the top of the stack. (In the above case, it > > would only consider the states registered by ACPI). > > > > If the top-of-the-stack set of idle states is unregistered, the next > > set of states on the stack are considered. > > > > > Also, does this imply the governor doesn't consider these idle routines? > > > > > > > As i said above, governor would only consider the idle routines which > > are at the top of the stack. > > > > Hope this gave a better idea.. > > So does it make sense to have a set of sets? > > Why not integrate them all into one set to be ruled by this governor > thing? > Right now there is a clean hierarchy. So breaking that would mean putting the registration of all idle routines under ACPI. So, if ACPI fails to come up or if ACPI is not supported, that would lead to problems. Because if that happens now, we can fallback to the initially registered set. Also, if a module wants to register a set of routines later on, that cant be added to the initially registered set. So i think we need this set of sets.
On Thu, 2009-10-08 at 16:31 +0530, Arun R Bharadwaj wrote: > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:50:33]: > > > On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote: > > > > > > > So cpuidle didn't already have a list of idle functions it takes an > > > > appropriate one from? > > > > > > > > > > No.. As of now, cpuidle supported only one _set_ of idle states that > > > can be registered. So in this one set, it would choose the appropriate > > > idle state. But this list mechanism(actually a stack) allows for > > > multiple sets. > > > > > > This is needed because we have a hierarchy of idle states discovery > > > in x86. First, select_idle_routine() would select poll/mwait/default/c1e. > > > It doesn't know of existance of ACPI. Later when ACPI comes up, > > > it registers a set of routines on top of the earlier set. > > > > > > > Then what does this governor do? > > > > > > > > > > The governor would only select the best idle state available from the > > > set of states which is at the top of the stack. (In the above case, it > > > would only consider the states registered by ACPI). > > > > > > If the top-of-the-stack set of idle states is unregistered, the next > > > set of states on the stack are considered. > > > > > > > Also, does this imply the governor doesn't consider these idle routines? > > > > > > > > > > As i said above, governor would only consider the idle routines which > > > are at the top of the stack. > > > > > > Hope this gave a better idea.. > > > > So does it make sense to have a set of sets? > > > > Why not integrate them all into one set to be ruled by this governor > > thing? > > > > Right now there is a clean hierarchy. So breaking that would mean > putting the registration of all idle routines under ACPI. Uhm, no, it would mean ACPI putting its idle routines on the same level as all others. > So, if ACPI > fails to come up or if ACPI is not supported, that would lead to > problems. I think the problem is that ACPI is thinking its special, that should be rectified, its not. > Because if that happens now, we can fallback to the > initially registered set. I'm thinking its all daft and we should be having one set of idle routines, if ACPI fails (a tautology if ever there was one) we simply wouldn't have its idle routines to pick from. > Also, if a module wants to register a set of routines later on, that > cant be added to the initially registered set. So i think we need this > set of sets. Sounds like something is wrong alright. If you can register an idle routine you should be able to unregister it too. What about making ACPI register its idle routines too, 1 for each C state, and have the governor make a selection out of the full set? That also allows you to do away with this default_idle() nonsense and simply panic the box when there are no registered idle routines when the box wants to go idle.
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 13:25:10]: > On Thu, 2009-10-08 at 16:31 +0530, Arun R Bharadwaj wrote: > > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 12:50:33]: > > > > > On Thu, 2009-10-08 at 16:12 +0530, Arun R Bharadwaj wrote: > > > > > > > > > So cpuidle didn't already have a list of idle functions it takes an > > > > > appropriate one from? > > > > > > > > > > > > > No.. As of now, cpuidle supported only one _set_ of idle states that > > > > can be registered. So in this one set, it would choose the appropriate > > > > idle state. But this list mechanism(actually a stack) allows for > > > > multiple sets. > > > > > > > > This is needed because we have a hierarchy of idle states discovery > > > > in x86. First, select_idle_routine() would select poll/mwait/default/c1e. > > > > It doesn't know of existance of ACPI. Later when ACPI comes up, > > > > it registers a set of routines on top of the earlier set. > > > > > > > > > Then what does this governor do? > > > > > > > > > > > > > The governor would only select the best idle state available from the > > > > set of states which is at the top of the stack. (In the above case, it > > > > would only consider the states registered by ACPI). > > > > > > > > If the top-of-the-stack set of idle states is unregistered, the next > > > > set of states on the stack are considered. > > > > > > > > > Also, does this imply the governor doesn't consider these idle routines? > > > > > > > > > > > > > As i said above, governor would only consider the idle routines which > > > > are at the top of the stack. > > > > > > > > Hope this gave a better idea.. > > > > > > So does it make sense to have a set of sets? > > > > > > Why not integrate them all into one set to be ruled by this governor > > > thing? > > > > > > > Right now there is a clean hierarchy. So breaking that would mean > > putting the registration of all idle routines under ACPI. > > Uhm, no, it would mean ACPI putting its idle routines on the same level > as all others. > Putting them all on the same level would mean, we need an enable/disable routine to enable only the currently active routines. Also, the way governor works is that, it assumes that idle routines are indexed in the increasing order of power benefit that can be got out of the state. So this would get messed up. > > So, if ACPI > > fails to come up or if ACPI is not supported, that would lead to > > problems. > > I think the problem is that ACPI is thinking its special, that should be > rectified, its not. > > > Because if that happens now, we can fallback to the > > initially registered set. > > I'm thinking its all daft and we should be having one set of idle > routines, if ACPI fails (a tautology if ever there was one) we simply > wouldn't have its idle routines to pick from. > > > Also, if a module wants to register a set of routines later on, that > > cant be added to the initially registered set. So i think we need this > > set of sets. > > Sounds like something is wrong alright. If you can register an idle > routine you should be able to unregister it too. > Yes, we can register and unregister in a clean way now. Consider this. We have a set of routines A, B, C currently registered. Now a module comes and registers D and E, and later on at some point of time wants to unregister. So how do you keep track of what all idle routines the module registered and unregister only those? Best way to do that is a stack, which is how I have currently implemented. > What about making ACPI register its idle routines too, 1 for each C > state, and have the governor make a selection out of the full set? > > That also allows you to do away with this default_idle() nonsense and > simply panic the box when there are no registered idle routines when the > box wants to go idle. >
On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote: > > > Uhm, no, it would mean ACPI putting its idle routines on the same level > > as all others. > > > > Putting them all on the same level would mean, we need an > enable/disable routine to enable only the currently active routines. What's this enable/disable stuff about? > Also, the way governor works is that, it assumes that idle routines > are indexed in the increasing order of power benefit that can be got > out of the state. So this would get messed up. Right, which is why I initially had a power-savings field in my proposal, so it could weight the power savings vs the wakeup latency. http://lkml.org/lkml/2009/8/27/159 There it was said that was exactly what these governors were doing, seems its not. > > Sounds like something is wrong alright. If you can register an idle > > routine you should be able to unregister it too. > > > > Yes, we can register and unregister in a clean way now. > Consider this. We have a set of routines A, B, C currently registered. > Now a module comes and registers D and E, and later on at some point > of time wants to unregister. So how do you keep track of what all idle > routines the module registered and unregister only those? > Best way to do that is a stack, which is how I have currently > implemented. Right, so destroy that inner set thing, that way we only have one left ;-)
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 14:25:37]: > On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote: > > > > > Uhm, no, it would mean ACPI putting its idle routines on the same level > > > as all others. > > > > > > > Putting them all on the same level would mean, we need an > > enable/disable routine to enable only the currently active routines. > > What's this enable/disable stuff about? Don't we need an explicit 'don't use this routine' apart from having the weights based on power consumption and latency. In multiple registration the assumption is that the top most 'set' has all necessary routines and we do not need any other idle routines for optimum operation. Otherwise the governor has to select from larger 'set' which could have redundant or conflicting idle routines. For example we now have c1e_idle to start with and then a set of ACPI C1, C2, C3 routines. The expectation now is that once we have the ACPI's routines, we do not need the previous used c1e_idle because this new set is self contained and picking one from this set based on latency is good for power savings. > > Also, the way governor works is that, it assumes that idle routines > > are indexed in the increasing order of power benefit that can be got > > out of the state. So this would get messed up. > > Right, which is why I initially had a power-savings field in my > proposal, so it could weight the power savings vs the wakeup latency. > > http://lkml.org/lkml/2009/8/27/159 This proposal that you had suggested is being used for the 'set' of idle routines. The patch changes the idle routines as 'sets' and does not mix use of routines between two registrations. > There it was said that was exactly what these governors were doing, > seems its not. Governors select from a set of idle routines based on latency. But there is a probability that any of the routine in the set can be selected. > > > Sounds like something is wrong alright. If you can register an idle > > > routine you should be able to unregister it too. > > > > > > > Yes, we can register and unregister in a clean way now. > > Consider this. We have a set of routines A, B, C currently registered. > > Now a module comes and registers D and E, and later on at some point > > of time wants to unregister. So how do you keep track of what all idle > > routines the module registered and unregister only those? > > Best way to do that is a stack, which is how I have currently > > implemented. > > Right, so destroy that inner set thing, that way we only have one > left ;-) If un-registration is not needed, then this framework can easily override the current set with the new one and not worry about the set of sets. Ideally, during system boot, we could wait until we know enough information about idle states and then have a single registration. The boot process can be in poll-idle until this decision happens. Like in x86, we can register either c1e_idle or ACPI's routines at a stage where we know for sure if ACPI is enabled or not. --Vaidy
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-10-08 14:25:37]: > On Thu, 2009-10-08 at 17:31 +0530, Arun R Bharadwaj wrote: > > > > > Uhm, no, it would mean ACPI putting its idle routines on the same level > > > as all others. > > > > > > > Putting them all on the same level would mean, we need an > > enable/disable routine to enable only the currently active routines. > > What's this enable/disable stuff about? > > > Also, the way governor works is that, it assumes that idle routines > > are indexed in the increasing order of power benefit that can be got > > out of the state. So this would get messed up. > > Right, which is why I initially had a power-savings field in my > proposal, so it could weight the power savings vs the wakeup latency. > > http://lkml.org/lkml/2009/8/27/159 > > There it was said that was exactly what these governors were doing, > seems its not. > > > > Sounds like something is wrong alright. If you can register an idle > > > routine you should be able to unregister it too. > > > > > > > Yes, we can register and unregister in a clean way now. > > Consider this. We have a set of routines A, B, C currently registered. > > Now a module comes and registers D and E, and later on at some point > > of time wants to unregister. So how do you keep track of what all idle > > routines the module registered and unregister only those? > > Best way to do that is a stack, which is how I have currently > > implemented. > > Right, so destroy that inner set thing, that way we only have one > left ;-) > I'm not convinced with your argument. Why dont we do this incrementally. Right now, this set of sets mechanism works fine and doesn't look like it has any obvious flaws in it. We have a clean register/unregister mechanism which solves all the earlier problems we started out to solve. We can gradually build on this and try to come up with a single set of idle states for a governor to choose from. thanks, arun
Peter Zijlstra <a.p.zijlstra@chello.nl> writes: > > So does it make sense to have a set of sets? > > Why not integrate them all into one set to be ruled by this governor > thing? cpuidle is currently optional, that is why the two level hierarchy is there so that you can still have simple idle selection without it. % size drivers/cpuidle/*.o text data bss dec hex filename 5514 1416 44 6974 1b3e drivers/cpuidle/built-in.o Adding it unconditionally would add ~7k to everyone who wants idle functions. I think making it unconditional would require putting it on a serious diet first. -Andi
* Andi Kleen <andi@firstfloor.org> [2009-10-12 20:00:05]: > Peter Zijlstra <a.p.zijlstra@chello.nl> writes: > > > > So does it make sense to have a set of sets? > > > > Why not integrate them all into one set to be ruled by this governor > > thing? > > cpuidle is currently optional, that is why the two level hierarchy > is there so that you can still have simple idle selection without it. > > % size drivers/cpuidle/*.o > text data bss dec hex filename > 5514 1416 44 6974 1b3e drivers/cpuidle/built-in.o > > Adding it unconditionally would add ~7k to everyone who wants idle functions. > > I think making it unconditional would require putting it on a serious > diet first. > Hi Andi, Yes, this is a valid point. How about something like this.. If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call which is called from cpu_idle() should call default_idle without involving the registering cpuidle steps. This should prevent bloating up of the kernel for archs which dont want to use cpuidle. --arun > -Andi > -- > ak@linux.intel.com -- Speaking for myself only.
> How about something like this.. > If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call > which is called from cpu_idle() should call default_idle without > involving the registering cpuidle steps. This should prevent bloating > up of the kernel for archs which dont want to use cpuidle. On x86 some people want small kernel too, so selecting it on a architecture granuality is not good. Also you can make it default, you just need to slim it down first. -Andi
* Andi Kleen <andi@firstfloor.org> [2009-10-14 09:18:38]: > > How about something like this.. > > If the arch does not enable CONFIG_CPU_IDLE, the cpuidle_idle_call > > which is called from cpu_idle() should call default_idle without > > involving the registering cpuidle steps. This should prevent bloating > > up of the kernel for archs which dont want to use cpuidle. > > On x86 some people want small kernel too, so selecting it on a architecture > granuality is not good. Also you can make it default, you just need > to slim it down first. > No, I dont mean selecting it on an architecture granularity. At compile time, if CONFIG_CPU_IDLE is disabled, the arch can redefine cpuidle_idle_call. For e.g. in arch/x86/kernel/process.c #ifndef CONFIG_CPU_IDLE void cpuidle_idle_call(void) { if (local_idle) local_idle(); else default_idle(); } #endif where local_idle points to the idle routine selected using select_idle_routine() which can be poll, mwait, c1e. So this way, we still preserve the exact same functionality as before and we also remove the ugly pm_idle exported function pointer and we avoid unnecessary code bloat for platforms who do not want to use cpuidle. --arun
Index: linux.trees.git/drivers/cpuidle/cpuidle.c =================================================================== --- linux.trees.git.orig/drivers/cpuidle/cpuidle.c +++ linux.trees.git/drivers/cpuidle/cpuidle.c @@ -22,6 +22,7 @@ #include "cpuidle.h" DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices); +DEFINE_PER_CPU(struct list_head, cpuidle_devices_list); DEFINE_MUTEX(cpuidle_lock); @@ -112,6 +113,45 @@ void cpuidle_resume_and_unlock(void) EXPORT_SYMBOL_GPL(cpuidle_resume_and_unlock); +int cpuidle_add_to_list(struct cpuidle_device *dev) +{ + int ret, cpu = dev->cpu; + struct cpuidle_device *old_dev; + + if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) { + old_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu), + struct cpuidle_device, idle_list); + cpuidle_remove_state_sysfs(old_dev); + } + + list_add(&dev->idle_list, &per_cpu(cpuidle_devices_list, cpu)); + ret = cpuidle_add_state_sysfs(dev); + return ret; +} + +void cpuidle_remove_from_list(struct cpuidle_device *dev) +{ + struct cpuidle_device *temp_dev; + struct list_head *pos; + int ret, cpu = dev->cpu; + + list_for_each(pos, &per_cpu(cpuidle_devices_list, cpu)) { + temp_dev = container_of(pos, struct cpuidle_device, idle_list); + if (dev == temp_dev) { + list_del(&temp_dev->idle_list); + cpuidle_remove_state_sysfs(temp_dev); + break; + } + } + + if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) { + temp_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu), + struct cpuidle_device, idle_list); + ret = cpuidle_add_state_sysfs(temp_dev); + } + cpuidle_kick_cpus(); +} + /** * cpuidle_enable_device - enables idle PM for a CPU * @dev: the CPU @@ -136,9 +176,6 @@ int cpuidle_enable_device(struct cpuidle return ret; } - if ((ret = cpuidle_add_state_sysfs(dev))) - return ret; - if (cpuidle_curr_governor->enable && (ret = cpuidle_curr_governor->enable(dev))) goto fail_sysfs; @@ -157,7 +194,7 @@ int cpuidle_enable_device(struct cpuidle return 0; fail_sysfs: - cpuidle_remove_state_sysfs(dev); + cpuidle_remove_from_list(dev); return ret; } @@ -182,8 +219,6 @@ void cpuidle_disable_device(struct cpuid if (cpuidle_curr_governor->disable) cpuidle_curr_governor->disable(dev); - - cpuidle_remove_state_sysfs(dev); } EXPORT_SYMBOL_GPL(cpuidle_disable_device); @@ -267,6 +302,7 @@ int cpuidle_register_device(struct cpuid } cpuidle_enable_device(dev); + cpuidle_add_to_list(dev); mutex_unlock(&cpuidle_lock); @@ -288,6 +324,7 @@ void cpuidle_unregister_device(struct cp cpuidle_pause_and_lock(); cpuidle_disable_device(dev); + cpuidle_remove_from_list(dev); per_cpu(cpuidle_devices, dev->cpu) = NULL; @@ -338,12 +375,15 @@ static inline void latency_notifier_init */ static int __init cpuidle_init(void) { - int ret; + int ret, cpu; ret = cpuidle_add_class_sysfs(&cpu_sysdev_class); if (ret) return ret; + for_each_possible_cpu(cpu) + INIT_LIST_HEAD(&per_cpu(cpuidle_devices_list, cpu)); + latency_notifier_init(&cpuidle_latency_notifier); return 0; Index: linux.trees.git/include/linux/cpuidle.h =================================================================== --- linux.trees.git.orig/include/linux/cpuidle.h +++ linux.trees.git/include/linux/cpuidle.h @@ -92,6 +92,7 @@ struct cpuidle_device { struct cpuidle_state_kobj *kobjs[CPUIDLE_STATE_MAX]; struct cpuidle_state *last_state; + struct list_head idle_list; struct kobject kobj; struct completion kobj_unregister; void *governor_data;
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-10-08 15:18:28]: Implement a list based registering mechanism for architectures which have multiple sets of idle routines which are to be registered. Currently, in x86 it is done by merely setting pm_idle = idle_routine and managing this pm_idle pointer is messy. To give an example of how this mechanism works: In x86, initially, idle routine is selected from the set of poll/mwait/ c1e/default idle loops. So the selected idle loop is registered in cpuidle as one idle state cpuidle devices. Once ACPI comes up, it registers another set of idle states on top of this state. Again, suppose a module registers another set of idle loops, it is added to this list. This provides a clean way of registering and unregistering idle state routines. In the current implementation, pm_idle is set as the current idle routine being used and the old idle routine has to be maintained and when a module registers/unregisters an idle routine, confusion arises. Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> --- drivers/cpuidle/cpuidle.c | 54 ++++++++++++++++++++++++++++++++++++++++------ include/linux/cpuidle.h | 1 2 files changed, 48 insertions(+), 7 deletions(-)