Message ID | 20090819112554.GY12579@kernel.dk |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
On 08/19/2009 07:25 AM, Jens Axboe wrote: > Hi, > > On boxes with lots of CPUs, we have so many kernel threads it's not > funny. The basic problem is that create_workqueue() creates a per-cpu > thread, where we could easily get by with a single thread for a lot of > cases. > > One such case appears to be ata_wq. You want at most one per pio drive, > not one per CPU. I'd suggest just dropping it to a single threaded wq. > > Signed-off-by: Jens Axboe<jens.axboe@oracle.com> > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index 072ba5e..0d78628 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -6580,7 +6580,7 @@ static int __init ata_init(void) > { > ata_parse_force_param(); > > - ata_wq = create_workqueue("ata"); > + ata_wq = create_singlethread_workqueue("ata"); > if (!ata_wq) > goto free_force_tbl; I agree with one-thread-per-cpu is too much, in these modern multi-core times, but one thread is too little. You have essentially re-created simplex DMA -- blocking and waiting such that one drive out of ~4 can be used at any one time. ata_pio_task() is in a workqueue so that it can sleep and/or spend a long time polling ATA registers. That means an active task can definitely starve all other tasks in the workqueue, if only one thread is available. If starvation occurs, it will potentially pause the unrelated task for several seconds. The proposed patch actually expands an existing problem -- uniprocessor case, where there is only one workqueue thread. For the reasons outlined above, we actually want multiple threads even in the UP case. If you have more than one PIO device, latency is bloody awful, with occasional multi-second "hiccups" as one PIO devices waits for another. It's an ugly wart that users DO occasionally complain about; luckily most users have at most one PIO polled device. It would be nice if we could replace this workqueue with a thread pool, where thread count inside the pool ranges from zero to $N depending on level of thread pool activity. Our common case in libata would be _zero_ threads, if so... Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 19 2009, Jeff Garzik wrote: > On 08/19/2009 07:25 AM, Jens Axboe wrote: >> Hi, >> >> On boxes with lots of CPUs, we have so many kernel threads it's not >> funny. The basic problem is that create_workqueue() creates a per-cpu >> thread, where we could easily get by with a single thread for a lot of >> cases. >> >> One such case appears to be ata_wq. You want at most one per pio drive, >> not one per CPU. I'd suggest just dropping it to a single threaded wq. >> >> Signed-off-by: Jens Axboe<jens.axboe@oracle.com> >> >> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c >> index 072ba5e..0d78628 100644 >> --- a/drivers/ata/libata-core.c >> +++ b/drivers/ata/libata-core.c >> @@ -6580,7 +6580,7 @@ static int __init ata_init(void) >> { >> ata_parse_force_param(); >> >> - ata_wq = create_workqueue("ata"); >> + ata_wq = create_singlethread_workqueue("ata"); >> if (!ata_wq) >> goto free_force_tbl; > > > I agree with one-thread-per-cpu is too much, in these modern multi-core > times, but one thread is too little. You have essentially re-created > simplex DMA -- blocking and waiting such that one drive out of ~4 can be > used at any one time. > > ata_pio_task() is in a workqueue so that it can sleep and/or spend a > long time polling ATA registers. That means an active task can > definitely starve all other tasks in the workqueue, if only one thread > is available. If starvation occurs, it will potentially pause the > unrelated task for several seconds. > > The proposed patch actually expands an existing problem -- uniprocessor > case, where there is only one workqueue thread. For the reasons > outlined above, we actually want multiple threads even in the UP case. > If you have more than one PIO device, latency is bloody awful, with > occasional multi-second "hiccups" as one PIO devices waits for another. > It's an ugly wart that users DO occasionally complain about; luckily > most users have at most one PIO polled device. > > It would be nice if we could replace this workqueue with a thread pool, > where thread count inside the pool ranges from zero to $N depending on > level of thread pool activity. Our common case in libata would be > _zero_ threads, if so... That would be ideal, N essentially be: N = min(nr_cpus, nr_drives_that_need_pio); How can I easily test whether we will ever need a pio thread for a drive in libata? For a simple patch, I would suggest simply creating a single threaded workqueue per ap instead, if that ata_port would ever want PIO.
Jens Axboe wrote: > On Wed, Aug 19 2009, Jeff Garzik wrote: >> On 08/19/2009 07:25 AM, Jens Axboe wrote: >>> Hi, >>> >>> On boxes with lots of CPUs, we have so many kernel threads it's not >>> funny. The basic problem is that create_workqueue() creates a per-cpu >>> thread, where we could easily get by with a single thread for a lot of >>> cases. >>> >>> One such case appears to be ata_wq. You want at most one per pio drive, >>> not one per CPU. I'd suggest just dropping it to a single threaded wq. >>> >>> Signed-off-by: Jens Axboe<jens.axboe@oracle.com> >>> >>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c >>> index 072ba5e..0d78628 100644 >>> --- a/drivers/ata/libata-core.c >>> +++ b/drivers/ata/libata-core.c >>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void) >>> { >>> ata_parse_force_param(); >>> >>> - ata_wq = create_workqueue("ata"); >>> + ata_wq = create_singlethread_workqueue("ata"); >>> if (!ata_wq) >>> goto free_force_tbl; >> >> I agree with one-thread-per-cpu is too much, in these modern multi-core >> times, but one thread is too little. You have essentially re-created >> simplex DMA -- blocking and waiting such that one drive out of ~4 can be >> used at any one time. >> >> ata_pio_task() is in a workqueue so that it can sleep and/or spend a >> long time polling ATA registers. That means an active task can >> definitely starve all other tasks in the workqueue, if only one thread >> is available. If starvation occurs, it will potentially pause the >> unrelated task for several seconds. >> >> The proposed patch actually expands an existing problem -- uniprocessor >> case, where there is only one workqueue thread. For the reasons >> outlined above, we actually want multiple threads even in the UP case. >> If you have more than one PIO device, latency is bloody awful, with >> occasional multi-second "hiccups" as one PIO devices waits for another. >> It's an ugly wart that users DO occasionally complain about; luckily >> most users have at most one PIO polled device. >> >> It would be nice if we could replace this workqueue with a thread pool, >> where thread count inside the pool ranges from zero to $N depending on >> level of thread pool activity. Our common case in libata would be >> _zero_ threads, if so... > > That would be ideal, N essentially be: > > N = min(nr_cpus, nr_drives_that_need_pio); .. No, that would leave just a single thread again for UP. It would be nice to just create these threads on-demand, and destroy them again after periods of dis-use. Kind of like how Apache does worker threads. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 19 2009, Mark Lord wrote: > Jens Axboe wrote: >> On Wed, Aug 19 2009, Jeff Garzik wrote: >>> On 08/19/2009 07:25 AM, Jens Axboe wrote: >>>> Hi, >>>> >>>> On boxes with lots of CPUs, we have so many kernel threads it's not >>>> funny. The basic problem is that create_workqueue() creates a per-cpu >>>> thread, where we could easily get by with a single thread for a lot of >>>> cases. >>>> >>>> One such case appears to be ata_wq. You want at most one per pio drive, >>>> not one per CPU. I'd suggest just dropping it to a single threaded wq. >>>> >>>> Signed-off-by: Jens Axboe<jens.axboe@oracle.com> >>>> >>>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c >>>> index 072ba5e..0d78628 100644 >>>> --- a/drivers/ata/libata-core.c >>>> +++ b/drivers/ata/libata-core.c >>>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void) >>>> { >>>> ata_parse_force_param(); >>>> >>>> - ata_wq = create_workqueue("ata"); >>>> + ata_wq = create_singlethread_workqueue("ata"); >>>> if (!ata_wq) >>>> goto free_force_tbl; >>> >>> I agree with one-thread-per-cpu is too much, in these modern >>> multi-core times, but one thread is too little. You have >>> essentially re-created simplex DMA -- blocking and waiting such that >>> one drive out of ~4 can be used at any one time. >>> >>> ata_pio_task() is in a workqueue so that it can sleep and/or spend a >>> long time polling ATA registers. That means an active task can >>> definitely starve all other tasks in the workqueue, if only one >>> thread is available. If starvation occurs, it will potentially >>> pause the unrelated task for several seconds. >>> >>> The proposed patch actually expands an existing problem -- >>> uniprocessor case, where there is only one workqueue thread. For >>> the reasons outlined above, we actually want multiple threads even >>> in the UP case. If you have more than one PIO device, latency is >>> bloody awful, with occasional multi-second "hiccups" as one PIO >>> devices waits for another. It's an ugly wart that users DO >>> occasionally complain about; luckily most users have at most one PIO >>> polled device. >>> >>> It would be nice if we could replace this workqueue with a thread >>> pool, where thread count inside the pool ranges from zero to $N >>> depending on level of thread pool activity. Our common case in >>> libata would be _zero_ threads, if so... >> >> That would be ideal, N essentially be: >> >> N = min(nr_cpus, nr_drives_that_need_pio); > .. > > No, that would leave just a single thread again for UP. So one thread per ap would be better, then. > It would be nice to just create these threads on-demand, > and destroy them again after periods of dis-use. > Kind of like how Apache does worker threads. Well, that's the same thread pool suggestion that Jeff came up with. And I agree, that's a nicer long term solution (it's also how the per-bdi flushing replacement works). The problem with that appears to be that any suggested patchset for thread pools spiral into certain "but what color should it be?!" death. I'll try and work up a simple create_threadpool() implementation, we can literally cut away hundreds of threads with that.
On 08/19/2009 08:23 AM, Jens Axboe wrote: > On Wed, Aug 19 2009, Mark Lord wrote: >> Jens Axboe wrote: >>> On Wed, Aug 19 2009, Jeff Garzik wrote: >>>> On 08/19/2009 07:25 AM, Jens Axboe wrote: >>>>> Hi, >>>>> >>>>> On boxes with lots of CPUs, we have so many kernel threads it's not >>>>> funny. The basic problem is that create_workqueue() creates a per-cpu >>>>> thread, where we could easily get by with a single thread for a lot of >>>>> cases. >>>>> >>>>> One such case appears to be ata_wq. You want at most one per pio drive, >>>>> not one per CPU. I'd suggest just dropping it to a single threaded wq. >>>>> >>>>> Signed-off-by: Jens Axboe<jens.axboe@oracle.com> >>>>> >>>>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c >>>>> index 072ba5e..0d78628 100644 >>>>> --- a/drivers/ata/libata-core.c >>>>> +++ b/drivers/ata/libata-core.c >>>>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void) >>>>> { >>>>> ata_parse_force_param(); >>>>> >>>>> - ata_wq = create_workqueue("ata"); >>>>> + ata_wq = create_singlethread_workqueue("ata"); >>>>> if (!ata_wq) >>>>> goto free_force_tbl; >>>> >>>> I agree with one-thread-per-cpu is too much, in these modern >>>> multi-core times, but one thread is too little. You have >>>> essentially re-created simplex DMA -- blocking and waiting such that >>>> one drive out of ~4 can be used at any one time. >>>> >>>> ata_pio_task() is in a workqueue so that it can sleep and/or spend a >>>> long time polling ATA registers. That means an active task can >>>> definitely starve all other tasks in the workqueue, if only one >>>> thread is available. If starvation occurs, it will potentially >>>> pause the unrelated task for several seconds. >>>> >>>> The proposed patch actually expands an existing problem -- >>>> uniprocessor case, where there is only one workqueue thread. For >>>> the reasons outlined above, we actually want multiple threads even >>>> in the UP case. If you have more than one PIO device, latency is >>>> bloody awful, with occasional multi-second "hiccups" as one PIO >>>> devices waits for another. It's an ugly wart that users DO >>>> occasionally complain about; luckily most users have at most one PIO >>>> polled device. >>>> >>>> It would be nice if we could replace this workqueue with a thread >>>> pool, where thread count inside the pool ranges from zero to $N >>>> depending on level of thread pool activity. Our common case in >>>> libata would be _zero_ threads, if so... >>> >>> That would be ideal, N essentially be: >>> >>> N = min(nr_cpus, nr_drives_that_need_pio); >> .. >> >> No, that would leave just a single thread again for UP. > > So one thread per ap would be better, then. Yes. >> It would be nice to just create these threads on-demand, >> and destroy them again after periods of dis-use. >> Kind of like how Apache does worker threads. > > Well, that's the same thread pool suggestion that Jeff came up with. And > I agree, that's a nicer long term solution (it's also how the per-bdi > flushing replacement works). The problem with that appears to be that > any suggested patchset for thread pools spiral into certain "but what > color should it be?!" death. Let people complain with code :) libata has two basic needs in this area: (1) specifying a thread count other than "1" or "nr-cpus" (2) don't start unneeded threads / idle out unused threads > I'll try and work up a simple create_threadpool() implementation, we can > literally cut away hundreds of threads with that. That would be fantastic. It really _would_ remove hundreds of threads on this 2x4 (2 quad-core pkgs) system I have, for example. I bet systems like those sparc64 Niagaras or 32-core MIPS are completely insane with unused kernel threads... But if create_threadpool() proves too ambitious, having the ability for the subsystem to specify number of threads (#1, above) should suffice to fix your problem and the existing UP-thread-count problem. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> It would be nice to just create these threads on-demand, >>> and destroy them again after periods of dis-use. >>> Kind of like how Apache does worker threads. >> >> Well, that's the same thread pool suggestion that Jeff came up with. And >> I agree, that's a nicer long term solution (it's also how the per-bdi >> flushing replacement works). The problem with that appears to be that >> any suggested patchset for thread pools spiral into certain "but what >> color should it be?!" death. > > Let people complain with code :) libata has two basic needs in this area: > (1) specifying a thread count other than "1" or "nr-cpus" > (2) don't start unneeded threads / idle out unused threads To be even more general, libata needs a workqueue or thread pool that can (a) scale up to nr-drives-that-use-pio threads, on demand (b) scale down to zero threads, with lack of demand That handles the worst case of each PIO-polling drive needing to sleep (thus massively impacting latency, if any other PIO-polling drive must wait for a free thread). That also handles the best case of not needing any threads at all. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, guys. Jeff Garzik wrote: >> Let people complain with code :) libata has two basic needs in this area: >> (1) specifying a thread count other than "1" or "nr-cpus" >> (2) don't start unneeded threads / idle out unused threads > > To be even more general, > > libata needs a workqueue or thread pool that can > > (a) scale up to nr-drives-that-use-pio threads, on demand > (b) scale down to zero threads, with lack of demand > > That handles the worst case of each PIO-polling drive needing to sleep > (thus massively impacting latency, if any other PIO-polling drive must > wait for a free thread). > > That also handles the best case of not needing any threads at all. Heh... I've been trying to implement in-kernel media presence polling and hit about the same problem. The problem is quite widespread. The choice of multithreaded workqueue was intentional as Jeff explained. There are many workqueues which are created in fear of blocking or being blocked by other works although in most cases it shouldn't be a problem then there's the newly added async mechanism, which I don't quite get as it runs the worker function from different environment depending on resource availability - the worker function might be executed synchronously where it might have different context w.r.t. locking or whatever. So, I've spent some time thinking about alternative so that things can be unified. * Per-cpu binding is good. * Managing the right level of concurrency isn't easy. If we try to schedule works too soonish we can end up wasting resources and slow things down compared to the current somewhat confined work processing. If works are scheduled too late, resources will be underutilized. * Some workqueues are there to guarantee forward progress and avoid deadlocks around the work execution resource (workqueue threads). Similar mechanism needs to be in place. * It would be nice to implement async execution in terms of workqueue or even replace it with workqueue. My a bit crazy idea was like the followings. * All works get queued on a single unified per-cpu work list. * Perfect level of concurrency can be managed by hooking into scheduler and kicking a new worker thread iff the currently running worker is about to be scheduled out for whatever reason and there's no other worker ready to run. * Thread pool of a few idle threads is always maintained per cpu and they get used by the above scheduler hooking. When the thread pool gets exhausted, manager thread is scheduled instead and replenishes the pool. When there are too many idle threads, the pool size is reduced slowly. * Forward-progress can be guaranteed by reserving a single thread for any such group of works. When there are such works pending and the manager is invoked to replenish the worker pook, all such works on the queue are dispatched to their respective reserved threads. Please note that this will happen only rarely as the worker pool size will be kept enough and stable most of the time. * Async can be reimplemented as work which get assigned to cpus in round-robin manner. This wouldn't be perfect but should be enough. Managing the perfect level of concurrency would have benefits in resource usages, cache footprint, bandwidth and responsiveness. I haven't actually tried to implement the above yet and am still wondering whether the complexity is justified. Thanks.
Somewhat simpler for the general case would be to implement create_workqueue(min_threads,max_threads, max_thread_idle_time); queue_work_within(); at which point you can queue work to the workqueue but with a per workqueue timing running so you know when you might need to create a new thread if the current work hasn't finished. Idle threads would then sleep and either expire or pick up new work - so that under load we don't keep destructing threads. That might need a single thread (for the system) that does nothing but create workqueue threads to order. It could manage the "next workqueue deadline" timer and thread creation. The threads would then pick up their work queue work. There is an intrinsic race where if we are just on the limit we might create a thread just as there is no work to be done - but it would be rare and would just then go away. I see no point tring to fix ata when applying sanity to the workqueue logic would sort a lot of other cases out nicely. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Alan. Alan Cox wrote: > Somewhat simpler for the general case would be to implement Yeah, it would be great if there's a much simpler solution. Latency based one could be a good compromise. > create_workqueue(min_threads,max_threads, max_thread_idle_time); > queue_work_within(); > > at which point you can queue work to the workqueue but with a per > workqueue timing running so you know when you might need to create a new > thread if the current work hasn't finished. Idle threads would then sleep > and either expire or pick up new work - so that under load we don't keep > destructing threads. Are work threads per workqueue? Combined with per-cpu binding, dynamic thread pool per workqueue can get quite messy. All three factors end up getting multiplied - ie. #cpus * pool_size which can be enlarged by the same work hopping around * #workqueues. > That might need a single thread (for the system) that does nothing but > create workqueue threads to order. It could manage the "next workqueue > deadline" timer and thread creation. Another problem is that if we apply this to the existing default workqueue which is used by many different supposed-to-be-short works in essentially batch mode, we might end up enlarging cache footprint by scheduling unnecessarily many threads, which, in tight situations, might show up as small but noticeable performance regression. > The threads would then pick up their work queue work. There is an > intrinsic race where if we are just on the limit we might create a > thread just as there is no work to be done - but it would be rare > and would just then go away. Agreed. As long as the pool size is reduced gradually maybe with some buffer, I don't think this would be an issue. > I see no point tring to fix ata when applying sanity to the workqueue > logic would sort a lot of other cases out nicely. About the same problem exists for in-kernel media presence polling. Currently, I'm thinking about creating a worker thread per any device which requires polling. It isn't too bad but it would still be far better if I can just schedule a work and don't have to worry about managing concurrency. Thanks.
> Are work threads per workqueue? Combined with per-cpu binding, > dynamic thread pool per workqueue can get quite messy. All three > factors end up getting multiplied - ie. #cpus * pool_size which can be > enlarged by the same work hopping around * #workqueues. Stop a moment. Exactly how many work queue users need per cpu binding wizardry ? > Another problem is that if we apply this to the existing default > workqueue which is used by many different supposed-to-be-short works > in essentially batch mode, we might end up enlarging cache footprint > by scheduling unnecessarily many threads, which, in tight situations, > might show up as small but noticeable performance regression. Only if you make the default assumed max wait time for the work too low - its a tunable behaviour in fact. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Cox wrote: >> Are work threads per workqueue? Combined with per-cpu binding, >> dynamic thread pool per workqueue can get quite messy. All three >> factors end up getting multiplied - ie. #cpus * pool_size which can be >> enlarged by the same work hopping around * #workqueues. > > Stop a moment. Exactly how many work queue users need per cpu binding > wizardry ? It's not about needing per-cpu binding but if works can be executed on the same cpu they were issued, it's almost always beneficial. The only reason why we have single threaded workqueue now is to limit the number of threads. >> Another problem is that if we apply this to the existing default >> workqueue which is used by many different supposed-to-be-short works >> in essentially batch mode, we might end up enlarging cache footprint >> by scheduling unnecessarily many threads, which, in tight situations, >> might show up as small but noticeable performance regression. > > Only if you make the default assumed max wait time for the work too low - > its a tunable behaviour in fact. If the default workqueue is made to manage concurrency well, most works should be able to just use it, so the queue will contain both long running ones and short running ones which can disturb the current batch like processing of the default workqueue which is assumed to have only short ones. If we maintain separate workqueues for different purposes, a tuning knob could be enough. If we try to manage the whole thing in uniform manner, a good tune might not exist. I'm not sure either way yet. One of the reasons the current situation is messy is because there are too many segregations among different thread pools (different workqueues, async thread pool, other private kthreads). It would be great if a single work API is exported and concurrency is managed automatically so that no one else has to worry about concurrency but achieving that requires much more intelligence on the workqueue implementation as the basic concurrency policies which used to be imposed by those segregations need to be handled automatically. Maybe it's better trade-off to leave those segregations as-are and just add another workqueue type with dynamic thread pool. Thanks.
> It's not about needing per-cpu binding but if works can be executed on > the same cpu they were issued, it's almost always beneficial. The > only reason why we have single threaded workqueue now is to limit the > number of threads. That would argue very strongly for putting all the logic in one place so everything shares queues. > > Only if you make the default assumed max wait time for the work too low - > > its a tunable behaviour in fact. > > If the default workqueue is made to manage concurrency well, most > works should be able to just use it, so the queue will contain both > long running ones and short running ones which can disturb the current > batch like processing of the default workqueue which is assumed to > have only short ones. Not sure why it matters - the short ones will instead end up being processed serially in parallel to the hog. > kthreads). It would be great if a single work API is exported and > concurrency is managed automatically so that no one else has to worry > about concurrency but achieving that requires much more intelligence > on the workqueue implementation as the basic concurrency policies > which used to be imposed by those segregations need to be handled > automatically. Maybe it's better trade-off to leave those > segregations as-are and just add another workqueue type with dynamic > thread pool. The more intelligence in the workqueue logic, the less in the drivers and the more it can be adjusted and adapt itself. Consider things like power management which might argue for breaking the cpu affinity to avoid waking up a sleeping CPU in preference to jumping work between processors -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-08-19 at 14:23 +0200, Jens Axboe wrote: > Well, that's the same thread pool suggestion that Jeff came up with. > And > I agree, that's a nicer long term solution (it's also how the per-bdi > flushing replacement works). The problem with that appears to be that > any suggested patchset for thread pools spiral into certain "but what > color should it be?!" death. > > I'll try and work up a simple create_threadpool() implementation, we > can > literally cut away hundreds of threads with that. Don't we have one already ? Dave Howells slow_work or such ? Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Cox wrote: > Somewhat simpler for the general case would be to implement > > create_workqueue(min_threads,max_threads, max_thread_idle_time); > queue_work_within(); > > at which point you can queue work to the workqueue but with a per > workqueue timing running so you know when you might need to create a new > thread if the current work hasn't finished. Idle threads would then sleep > and either expire or pick up new work [...] Sounds a lot like kernel/slow-work.c / include/linux/slow-work.h to me (as Ben already pointed out).
Stefan Richter wrote: > Sounds a lot like kernel/slow-work.c / include/linux/slow-work.h to me > (as Ben already pointed out). On the other hand, Jens' new create_lazy_workqueue() from two hours ago looks very nice in terms of the effort to convert candidate users to it.
Hello, Alan. Alan Cox wrote: >> It's not about needing per-cpu binding but if works can be executed on >> the same cpu they were issued, it's almost always beneficial. The >> only reason why we have single threaded workqueue now is to limit the >> number of threads. > > That would argue very strongly for putting all the logic in one place so > everything shares queues. Yes, it does. >>> Only if you make the default assumed max wait time for the work too low - >>> its a tunable behaviour in fact. >> If the default workqueue is made to manage concurrency well, most >> works should be able to just use it, so the queue will contain both >> long running ones and short running ones which can disturb the current >> batch like processing of the default workqueue which is assumed to >> have only short ones. > > Not sure why it matters - the short ones will instead end up being > processed serially in parallel to the hog. The problem is how to assign works to workers. With long running works, workqueue will definitely need some reserves in the worker pool. When short works are consecutively queued, without special provision, they'll end up served by different workers increasing cache foot print and execution overhead. The special provision could be something timer based but modding timer for each work is a bit expensive. I think it needs to be more mechanical rather than depend on heuristics or timing. >> kthreads). It would be great if a single work API is exported and >> concurrency is managed automatically so that no one else has to worry >> about concurrency but achieving that requires much more intelligence >> on the workqueue implementation as the basic concurrency policies >> which used to be imposed by those segregations need to be handled >> automatically. Maybe it's better trade-off to leave those >> segregations as-are and just add another workqueue type with dynamic >> thread pool. > > The more intelligence in the workqueue logic, the less in the drivers and > the more it can be adjusted and adapt itself. Yeap, sure. > Consider things like power management which might argue for breaking > the cpu affinity to avoid waking up a sleeping CPU in preference to > jumping work between processors Yeah, that's one thing to consider too but works being scheduled on a particular cpu usually is the result of other activities going on the cpu. I don't think workqueue needs to be modified for that. If other things move, workqueue will automatically follow. Thanks.
Benjamin Herrenschmidt wrote: > On Wed, 2009-08-19 at 14:23 +0200, Jens Axboe wrote: >> Well, that's the same thread pool suggestion that Jeff came up with. >> And >> I agree, that's a nicer long term solution (it's also how the per-bdi >> flushing replacement works). The problem with that appears to be that >> any suggested patchset for thread pools spiral into certain "but what >> color should it be?!" death. >> >> I'll try and work up a simple create_threadpool() implementation, we >> can >> literally cut away hundreds of threads with that. > > Don't we have one already ? Dave Howells slow_work or such ? Yes, it addresses different aspect of the concurrency problem. Might be more suitable for ATA workqueues but definitely more costly to convert to. Argh...
Tejun Heo wrote: > Yes, it addresses different aspect of the concurrency problem. Might > be more suitable for ATA workqueues but definitely more costly to > convert to. Argh... ^ compared to Jens's lazy workqueue.
On Thu, 2009-08-20 at 21:48 +0900, Tejun Heo wrote: > Tejun Heo wrote: > > Yes, it addresses different aspect of the concurrency problem. Might > > be more suitable for ATA workqueues but definitely more costly to > > convert to. Argh... > ^ compared to Jens's lazy workqueue. So there are two issues here. One is ATAs need for execution in user context that won't block other execution ... I really think that if there's an existing pattern for this in the kernel, we should use it rather than inventing our own. The other is the question of whether the workqueue concept itself is flawed. This business of some jobs blocking other jobs due to execution order on the queue can be a nasty side effect and it can lead to entangled deadlocks, but for some uses, the whole concept of queued jobs following a set order is necessary. It might be appropriate to think about whether we want to convert the whole workqueue infrastructure to something like slow_work instead and possibly think about ordering on top of this. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 072ba5e..0d78628 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -6580,7 +6580,7 @@ static int __init ata_init(void) { ata_parse_force_param(); - ata_wq = create_workqueue("ata"); + ata_wq = create_singlethread_workqueue("ata"); if (!ata_wq) goto free_force_tbl;
Hi, On boxes with lots of CPUs, we have so many kernel threads it's not funny. The basic problem is that create_workqueue() creates a per-cpu thread, where we could easily get by with a single thread for a lot of cases. One such case appears to be ata_wq. You want at most one per pio drive, not one per CPU. I'd suggest just dropping it to a single threaded wq. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>